dt/center processing problem fix

Dave,

The 4 August 2000 and earlier versions of HTML Tidy contain a bug
which causes a segmentation fault in the InsertNodeAfterElement
procedure when the specified element does not contain a parent.
This problem occurs when HTML Tidy attempts to parse an inferred
definition list which contains a center element as illustrated
in the following segment of HTML code.

  <BODY>
   <CENTER><H1>Heading 1</H1></CENTER>
    <DT><IMG src="redball.gif"><B>Term 1</B></DT>
    <DT><IMG src="redball.gif"><B>Term 2</B><HR></DT>
   <CENTER><H1>Heading 2</H1></CENTER>

This problem had been reported by Glenn Carroll as a "dt/center
processing problem" in an e-mail dated Wed, Apr 19 2000, but I
have found no record of a reported fix in the html-tidy@w3.org
mail archive.

By using the HTML source file and HTML Tidy configuration file
presented in sections 1 and 2 of the text file attached with
this letter, I traced the problem to the code block labeled
"/* center in a dt or a dl breaks the dl list in two */" in the
ParseDefList procedure (lines 1457 to 1475 in parser.c).

  1457         /* center in a dt or a dl breaks the dl list in two */
  1458         if (node->tag == tag_center)
  1459         {
  1460             if (list->content)
  1461                 InsertNodeAfterElement(list, node);
  1462             else /* trim empty dl list */
  1463             {
  1464                 InsertNodeBeforeElement(list, node);
  1465                 DiscardElement(list);
  1466             }
  1467
  1468             /* and parse contents of center */
  1469             ParseTag(lexer, node, mode);
  1470
  1471             /* now create a new dl element */
  1472             list = InferredTag(lexer, "dl");
  1473             InsertNodeAfterElement(node, list);
  1474             continue;
  1475         }

In the code block, ParseTag is called for the <CENTER> node
following the first set of <DT> elements which are not contained
in a <DL>...</DL> element.  When the <H1> node immediately after
the <CENTER> node is encountered by the ParseBlock procedure
(ParseTag procedure for center tag), the <CENTER> element is
discarded by the following block of code (lines 765 to 781 of
parser.c)

   765             else if (node->tag->model & CM_BLOCK)
   766             {
   767                 if (lexer->excludeBlocks)
   768                 {
   769                     if (!(element->tag->model & CM_OPT))
   770                         ReportWarning(lexer, element, node,
                                             MISSING_ENDTAG_BEFORE);
   771
   772                     UngetToken(lexer);
   773
   774                     if (element->tag->model & CM_OBJECT)
   775                         lexer->istackbase = istackbase;
   776
   777                     TrimSpaces(lexer, element);
   778                     TrimEmptyElement(lexer, element);
   779                     return;
   780                 }
   781             }

extracted from ParseBlock since the value of lexer->excludeBlocks
is true.  When processing returns from the ParseBlock (ParseTag)
procedure, the center element has been discarded and the center
element "node" passed in the call to InsertNodeAfterElement for
the inferred dl element "list" does not contain a valid pointer
to a parent node.

The occurrence of a center element in the definition list results
in the definition list to be split into two lists around the center
element.  Consequently, the center element is no longer contained
in a definition list and block elements are permitted.  Therefore,
based on my interpretation of HTML Tidy processing in this case, I
believe the lexer->excludeBlocks flag needs to be set to no before
the center node is parsed and then set to yes before ParseDefList
processing continues with a new definition list as illustrated
below.

  1466             }
  1467
  1468             /* and parse contents of center */
+                  lexer->excludeBlocks = no;
  1469             ParseTag(lexer, node, mode);
+                  lexer->excludeBlocks = yes;
  1470
  1471             /* now create a new dl element */


The text file "INFO_1.txt" provided as an attachment with this
letter contains the following sections which present information
to substantiate my findings.

  1. HTML Source File - coredump2_O.htm
  2. HTML Tidy Configuration File - coredump2.cfg
  3. Original HTML Tidy Execution
  4. Examination of Core Dump with gdb
  5. HTML Tidy Source Patches
  6. Patched HTML Tidy Execution

The HTML source file contains a condensed portion of an actual
web page which caused the segmentation fault and incorporates the
same HTML coding errors -- missing <DL> and </DL> tags,  needless
</DT> tags, missing <DD> tags, and incorrect use of UL tags
instead of DL tags.  I presume the web page author intended to
use a definition list to create custom bullets for an unordered
list instead of utilizing CSS to define a list-style-image property
for unordered list elements.

Respectfully,
Gary Deschaines
gary.deschaines@netmechanic.com
FILE:  INFO_1.txt (attachment to MEMO_1.txt)
DATE:  10 AUG 2000

-------------------------------------
1. HTML Source File - coredump2_O.htm
-------------------------------------
<HTML>
 <HEAD>
  <TITLE>Core Dump Case 2</TITLE>
 </HEAD>
 <BODY>
  <CENTER><H1>Heading 1</H1></CENTER>
   <DT><IMG src="redball.gif"><B>Term 1</B></DT>
   <DT><IMG src="redball.gif"><B>Term 2</B><HR></DT>
  <CENTER><H1>Heading 2</H1></CENTER>
  <UL>
   <DT><IMG src="redball.gif"><B>Term 3</B></DT>
   <DT><IMG src="redball.gif"><B>Term 4</B><HR></DT>
  </UL>
 </BODY>
</HTML>

-----------------------------------------------
2. HTML Tidy Configuration File - coredump2.cfg
-----------------------------------------------
write-back: no
tidy-mark: no
quote-ampersand: no
show-warnings: yes
char-encoding: raw
markup: yes
show-acc-warnings: no
hide-endtags: no
uppercase-tags: no
uppercase-attributes: no
wrap-script-literals: no
numeric-entities: no
indent: auto
wrap: 0
logical-emphasis: no
clean: no
drop-font-tags: no

-------------------------------
3. Original HTML Tidy Execution
-------------------------------
../orig/tidy -e -config coredump2.cfg coredump2_O.htm

Tidy (vers 4th August 2000) Parsing "coredump2_O.htm"
line 7 column 4 - Warning: <dt> isn't allowed in <body> elements
line 7 column 4 - Warning: inserting implicit <dl>
line 7 column 8 - Warning: <img> lacks "alt" attribute
line 8 column 8 - Warning: <img> lacks "alt" attribute
line 8 column 44 - Warning: <hr> isn't allowed in <dt> elements
line 8 column 48 - Warning: trimming empty <dt>
line 9 column 11 - Warning: missing </center> before <h1>
line 9 column 11 - Warning: trimming empty <center>
Segmentation fault (core dumped)

------------------------------------
4. Examination of Core Dump with gdb
------------------------------------
gdb -nx ../orig/tidy -c core

GNU gdb 19991004
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux"...
Core was generated by `../orig/tidy -e -config coredump2.cfg coredump2_O.htm'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/libc.so.6...done.
Reading symbols from /lib/ld-linux.so.2...done.
#0  0x804a6bd in InsertNodeAfterElement (element=0x8071a88, node=0x8071a88) at parser.c:205
205         if (parent->last == element)

(gdb) where

#0  0x804a6bd in InsertNodeAfterElement (element=0x8071a88, node=0x8071a88) at parser.c:205
#1  0x804cc13 in ParseDefList (lexer=0x806f2c0, list=0x8071a88, mode=0) at parser.c:1473
#2  0x804ac47 in ParseTag (lexer=0x806f2c0, node=0x8071640, mode=0) at parser.c:432
#3  0x804f4ef in ParseBody (lexer=0x806f2c0, body=0x80714c0, mode=0) at parser.c:2883
#4  0x804ac47 in ParseTag (lexer=0x806f2c0, node=0x80714c0, mode=0) at parser.c:432
#5  0x804fe81 in ParseHTML (lexer=0x806f2c0, html=0x8071390, mode=0) at parser.c:3217
#6  0x804ffb9 in ParseDocument (lexer=0x806f2c0) at parser.c:3264
#7  0x80604a4 in main (argc=2, argv=0xbffff9f0) at tidy.c:956

(gdb) l 205

200         Node *parent;
201
202         parent = element->parent;
203         node->parent = parent;
204
205         if (parent->last == element)
206             parent->last = node;
207         else
208         {
209             node->next = element->next;

(gdb) p element->parent

$1 = (struct _node *) 0x0

---------------------------
5. HTML Tidy Source Patches
---------------------------
*** ./orig/parser.c     Fri Aug  4 12:21:05 2000
--- ./code/parser.c     Thu Aug 10 09:27:27 2000
***************
*** 1466,1472 ****
--- 1466,1474 ----
              }
  
              /* and parse contents of center */
+             lexer->excludeBlocks = no;
              ParseTag(lexer, node, mode);
+             lexer->excludeBlocks = yes;
  
              /* now create a new dl element */
              list = InferredTag(lexer, "dl");

------------------------------
6. Patched HTML Tidy Execution
------------------------------
../code/tidy -e -config coredump2.cfg coredump2_O.htm

Tidy (vers 4th August 2000) Parsing "coredump2_O.htm"
line 7 column 4 - Warning: <dt> isn't allowed in <body> elements
line 7 column 4 - Warning: inserting implicit <dl>
line 7 column 8 - Warning: <img> lacks "alt" attribute
line 8 column 8 - Warning: <img> lacks "alt" attribute
line 8 column 44 - Warning: <hr> isn't allowed in <dt> elements
line 8 column 48 - Warning: trimming empty <dt>
line 10 column 3 - Warning: trimming empty <dl>
line 11 column 4 - Warning: missing <li>
line 11 column 4 - Warning: inserting implicit <dl>
line 11 column 8 - Warning: <img> lacks "alt" attribute
line 12 column 8 - Warning: <img> lacks "alt" attribute
line 12 column 44 - Warning: <hr> isn't allowed in <dt> elements
line 12 column 48 - Warning: trimming empty <dt>
line 13 column 3 - Warning: missing </dl> before </ul>

coredump2_O.htm: Document content looks like HTML 3.2
14 warnings/errors were found!

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
  <head>
    <title>Core Dump Case 2</title>
  </head>

  <body>
    <center>
      <h1>Heading 1</h1>
    </center>

    <dl>
      <dt><img src="redball.gif"><b>Term 1</b></dt>

      <dt><img src="redball.gif"><b>Term 2</b></dt>

      <dd>
        <hr>
      </dd>
    </dl>

    <center>
      <h1>Heading 2</h1>
    </center>


    <div style="margin-left: 2em">
      <dl>
        <dt><img src="redball.gif"><b>Term 3</b></dt>

        <dt><img src="redball.gif"><b>Term 4</b></dt>

        <dd>
          <hr>
        </dd>
      </dl>
    </div>
  </body>
</html>

The alt attribute should be used to give a short description
of an image; longer descriptions should be given with the
longdesc attribute which takes a URL linked to the description.
These measures are needed for people using non-graphical browsers.

For further advice on how to make your pages accessible
see "http://www.w3.org/WAI/GL". You may also want to try
"http://www.cast.org/bobby/" which is a free Web-based
service for checking URLs for accessibility.

HTML & CSS specifications are available from http://www.w3.org/
To learn more about Tidy see http://www.w3.org/People/Raggett/tidy/
Please send bug reports to Dave Raggett care of <html-tidy@w3.org>
Lobby your company to join W3C, see http://www.w3.org/Consortium

Received on Thursday, 10 August 2000 14:24:44 UTC