W3C home > Mailing lists > Public > www-lib@w3.org > July to September 1996

HTML/SGML Parser

From: Kim Liu <KLIU@us.oracle.com>
Date: 20 Sep 96 12:45:10 -0700
Message-Id: <199609201951.MAA24270@mailsun2.us.oracle.com>
To: www-lib@w3.org
I have written as HTML/SGML parser as part of a project but I have problems 
finding examples to prove that this is useful. Many so-called HTML parsers 
parse HTML documents by simply maintaining a stack of tags seen so far and do 
some ad-hoc tag matching when an end tag comes up. This is basically what the 
libwww parser does. On the other hand, my parser does take into account the 
nesting rules defined by the HTML 3.2 DTD. It knows what tags are/aren't 
allowed inside a particular tag. It knows about optional start/end tags and is 
able to infer these omitted tags. However, it doesn't enforce the "sequencing" 
(eg. <X> must come before <Y>), and "at least one occurence" (eg. there must 
be one <title> in <head>) rules, etc.  
 
I thought it should be very easy to find a perfectly legal (but possibily very 
complicated) HTML file that only a parser like mine can parse correctly. But I 
am now getting the impression that HTML is defined in such a way that even 
simple stack-based parsers can parse things correctly. For example, you can 
have a situation like this: 
 
Given the tags <X><Y><Z>, 
<Y> has optional end tag and <Z> is not allowed in <Y> but it's allowed in 
<X>. The right parsed result should be <X><Y></Y><Z></X>. A simple stack-based 
parser will parse it as <X><Y><Z></Y></X>. However, given the tags in HTML 3.2 
that have optional end tags (eg LI, DD, DT, TR, TD, TH, P), it's still very 
easy to parse the above sequence correctly because the tags that enclose these 
tags (ie. OL, UL, TABLE, etc) only allow a very limited number tags inside 
them. The only easy way to break such a simple parser is something like  
<P>abc<TABLE>....</TABLE> 
Since according to the DTD< <TABLE> is not allowed in <P>, the above sequence 
should parse to <P>abc</P><TABLE>...</TABLE> instead of 
<P>abc<TABLE>...</TABLE></P>. But this case doesn't really mess up the 
rendering seriously. 
 
So, could someone give me a few examples (legal but possibly complicated) HTML 
that could mess up a simple stack-based parser (with simple hacks to recognize 
the boundary between <LI> elements and things like <TR>, <TD>, etc)? 
 
-Kim 
 
Received on Friday, 20 September 1996 15:48:05 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 23 April 2007 18:18:26 GMT