WWWLIB Parser, is there going to be any update on it? from Markku Savela on 1996-08-14 (www-lib@w3.org from July to September 1996)

From: Markku Savela <msa@msa.tte.vtt.fi>
Date: Wed, 14 Aug 1996 11:09:37 +0300 (EET DST)
To: www-lib@w3.org
cc: msa@msa.tte.vtt.fi
Message-Id: <199608140809.LAA18939@msa.tte.vtt.fi>
I have talked about this issue earlier, but couldn't find my message
concerning it from the archives (someone deleted it?)

I think the "SGML.c" in the library attempts to be too clever and
trips over. The control tables (HTMLPDTD.*) are not really sufficient
for full SGML parsing and SGML.c parser should not try to be such.

My suggestion is, that SGML.c should be stripped into simple
"SGML-tokenizer". It would produce technically the same output as it
does now (structured stream with elements, content and entities), but
it should not attempt any "fixing" or "checking" of the HTML.

The reason I ask this is: the current parser goes astray, and "fixes"
correct HTML into garbage. An example is the following TABLE within
TABLE:

	<HTML><HEAD><TITLE>Table test</TITLE></HEAD><BODY>
	<H1>Testing a table structure</H1>
	Some filler text.
	<P>
	<TABLE BORDER=1>
	<TR><TD>r1 c1<TD>r1 c2<TD>r1 c3
	<TR><TD>r2 c1<TD>r2
	<TABLE BORDER=1><TR><TD>x1<TD>x2<TR><TD>y1<TD>y2</TABLE>
	c2<TD>r2 c3
	</TABLE>
	</P>
	Some filler text.
	</BODY></HTML>


The output stream of element/content calls from the parser is
incorrect, it totally loses it at the end of inner table:

SGML Parser. Start <HTML>
SGML Parser. Start <HEAD>
SGML Parser. Start <TITLE>
SGML Parser. End   </TITLE>
SGML Parser. End   </HEAD>
SGML Parser. Start <BODY>
SGML Parser. Start <H1>
SGML Parser. End   </H1>
SGML Parser. Start <P>
SGML Parser. Start <TABLE>
SGML Parser. Start <TR>
SGML Parser. Start <TD>
SGML Parser. Start <TD>
SGML Parser. Start <TD>
SGML Parser. Start <TR>
SGML Parser. Start <TD>
SGML Parser. Start <TD>
SGML Parser. Start <TABLE>
SGML Parser. Start <TR>
SGML Parser. Start <TD>
SGML Parser. Start <TD>
SGML Parser. Start <TR>
SGML Parser. Start <TD>
SGML Parser. Start <TD>
SGML Parser. End   </TABLE>
SGML Parser. Found </TABLE> when expecting </TD>. </TD> assumed.
SGML Parser. Found </TABLE> when expecting </TD>. </TD> assumed.
SGML Parser. Found </TABLE> when expecting </TR>. </TR> assumed.
SGML Parser. Found </TABLE> when expecting </TD>. </TD> assumed.
SGML Parser. Found </TABLE> when expecting </TD>. </TD> assumed.
SGML Parser. Found </TABLE> when expecting </TR>. </TR> assumed.
SGML Parser. Start <TD>
SGML Parser. End   </TABLE>
SGML Parser. Found </TABLE> when expecting </TD>. </TD> assumed.
SGML Parser. Found </TABLE> when expecting </TD>. </TD> assumed.
SGML Parser. Found </TABLE> when expecting </TD>. </TD> assumed.
SGML Parser. Found </TABLE> when expecting </TR>. </TR> assumed.
SGML Parser. Found </TABLE> when expecting </TD>. </TD> assumed.
SGML Parser. Found </TABLE> when expecting </TD>. </TD> assumed.
SGML Parser. Found </TABLE> when expecting </TD>. </TD> assumed.
SGML Parser. Found </TABLE> when expecting </TR>. </TR> assumed.
SGML Parser. End   </P>
SGML Parser. End   </BODY>
SGML Parser. End   </HTML>


Btw, the HTMLPDT.c includes line

    { "LI"	, li_attr,	HTML_LI_ATTRIBUTES,	SGML_EMPTY },

LI is logically a container, should be SGML_MIXED instead. Similarly
some other tags marked as "EMPTY" should be reconsidered (DD, DT,
NOTE?, ...). Though, all this is unnecessary, if the SGML is
simplified to plain SGML tokenizer (then empty/mixed is of no interest
to it, though the flag might be useful for higher levels).

--
Markku Savela (msa@hemuli.tte.vtt.fi),     Technical Research Centre of Finland
Multimedia Systems, P.O.Box 1203,FIN-02044 VTT,http://www.vtt.fi/tte/staff/msa/
Received on Wednesday, 14 August 1996 04:10:00 UTC