W3C home > Mailing lists > Public > www-lib@w3.org > July to September 1996

Re: HTML/SGML Parser

From: Don Park <donpark@telewise.com>
Date: Fri, 20 Sep 1996 19:18:29 -0700
Message-Id: <199609210219.TAA21489@gw.quake.net>
To: "Kim Liu" <KLIU@us.oracle.com>, <www-lib@w3.org>
Kim,

Being a consultant, I had the mis-fortune of implementing a Yacc/Lex-based
HTML parser as well as a hand-coded recursive descent one.  In both cases,
my real problem really boiled down to guessing the intention of the HTML
author since I found that it is rather rare to find valid HTML pages.  For
example, the www.netscape.com page starts with </A> tag with no begin tag
in sight.  This sort of sloppyness would throw you off if your parser logic
is too strict.  Having better HTML tools is not improving the situation
since one small mistake can now be multiplied.  For example, www.news.com
page was written using a tool since it had multiple instances of <FONT
FACE=\"fontname\"> tags.  The problem took me only a few minute to allow
for such mistakes but it certainly left a very bad taste in my mouth.

A good HTML parser should treat the HTML specs as GUIDELINES only and allow
for maximum deviation from the spec.  For testing, I recommend that you
write a WebCrawler-like program that 'exercises' your parser by visiting as
many sites as possible and collecting address of the pages it had problems
with.  Keep a database of the problem pages for parser test runs.  Without
such a robot, you are just kidding yourself when you think your parser is
robust enough to take on all pages.

Sincerely,

Don Park

----------
> From: Kim Liu <KLIU@us.oracle.com>
> To: www-lib@w3.org
> Subject: HTML/SGML Parser
> Date: Friday, September 20, 1996 12:45 PM
> 
> I have written as HTML/SGML parser as part of a project but I have
problems 
> finding examples to prove that this is useful. Many so-called HTML
parsers 
> parse HTML documents by simply maintaining a stack of tags seen so far
and do 
> some ad-hoc tag matching when an end tag comes up. This is basically what
the 
> libwww parser does. On the other hand, my parser does take into account
the 
> nesting rules defined by the HTML 3.2 DTD. It knows what tags are/aren't 
> allowed inside a particular tag. It knows about optional start/end tags
and is 
> able to infer these omitted tags. However, it doesn't enforce the
"sequencing" 
> (eg. <X> must come before <Y>), and "at least one occurence" (eg. there
must 
> be one <title> in <head>) rules, etc.  
>  
> I thought it should be very easy to find a perfectly legal (but possibily
very 
> complicated) HTML file that only a parser like mine can parse correctly.
But I 
> am now getting the impression that HTML is defined in such a way that
even 
> simple stack-based parsers can parse things correctly. For example, you
can 
> have a situation like this: 
>  
> Given the tags <X><Y><Z>, 
> <Y> has optional end tag and <Z> is not allowed in <Y> but it's allowed
in 
> <X>. The right parsed result should be <X><Y></Y><Z></X>. A simple
stack-based 
> parser will parse it as <X><Y><Z></Y></X>. However, given the tags in
HTML 3.2 
> that have optional end tags (eg LI, DD, DT, TR, TD, TH, P), it's still
very 
> easy to parse the above sequence correctly because the tags that enclose
these 
> tags (ie. OL, UL, TABLE, etc) only allow a very limited number tags
inside 
> them. The only easy way to break such a simple parser is something like  
> <P>abc<TABLE>....</TABLE> 
> Since according to the DTD< <TABLE> is not allowed in <P>, the above
sequence 
> should parse to <P>abc</P><TABLE>...</TABLE> instead of 
> <P>abc<TABLE>...</TABLE></P>. But this case doesn't really mess up the 
> rendering seriously. 
>  
> So, could someone give me a few examples (legal but possibly complicated)
HTML 
> that could mess up a simple stack-based parser (with simple hacks to
recognize 
> the boundary between <LI> elements and things like <TR>, <TD>, etc)? 
>  
> -Kim 
>  
> 
Received on Friday, 20 September 1996 22:19:20 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 23 April 2007 18:18:26 GMT