- From: Markku Savela <msa@msa.tte.vtt.fi>
- Date: Wed, 14 Aug 1996 11:09:37 +0300 (EET DST)
- To: www-lib@w3.org
- cc: msa@msa.tte.vtt.fi
I have talked about this issue earlier, but couldn't find my message concerning it from the archives (someone deleted it?) I think the "SGML.c" in the library attempts to be too clever and trips over. The control tables (HTMLPDTD.*) are not really sufficient for full SGML parsing and SGML.c parser should not try to be such. My suggestion is, that SGML.c should be stripped into simple "SGML-tokenizer". It would produce technically the same output as it does now (structured stream with elements, content and entities), but it should not attempt any "fixing" or "checking" of the HTML. The reason I ask this is: the current parser goes astray, and "fixes" correct HTML into garbage. An example is the following TABLE within TABLE: <HTML><HEAD><TITLE>Table test</TITLE></HEAD><BODY> <H1>Testing a table structure</H1> Some filler text. <P> <TABLE BORDER=1> <TR><TD>r1 c1<TD>r1 c2<TD>r1 c3 <TR><TD>r2 c1<TD>r2 <TABLE BORDER=1><TR><TD>x1<TD>x2<TR><TD>y1<TD>y2</TABLE> c2<TD>r2 c3 </TABLE> </P> Some filler text. </BODY></HTML> The output stream of element/content calls from the parser is incorrect, it totally loses it at the end of inner table: SGML Parser. Start <HTML> SGML Parser. Start <HEAD> SGML Parser. Start <TITLE> SGML Parser. End </TITLE> SGML Parser. End </HEAD> SGML Parser. Start <BODY> SGML Parser. Start <H1> SGML Parser. End </H1> SGML Parser. Start <P> SGML Parser. Start <TABLE> SGML Parser. Start <TR> SGML Parser. Start <TD> SGML Parser. Start <TD> SGML Parser. Start <TD> SGML Parser. Start <TR> SGML Parser. Start <TD> SGML Parser. Start <TD> SGML Parser. Start <TABLE> SGML Parser. Start <TR> SGML Parser. Start <TD> SGML Parser. Start <TD> SGML Parser. Start <TR> SGML Parser. Start <TD> SGML Parser. Start <TD> SGML Parser. End </TABLE> SGML Parser. Found </TABLE> when expecting </TD>. </TD> assumed. SGML Parser. Found </TABLE> when expecting </TD>. </TD> assumed. SGML Parser. Found </TABLE> when expecting </TR>. </TR> assumed. SGML Parser. Found </TABLE> when expecting </TD>. </TD> assumed. SGML Parser. Found </TABLE> when expecting </TD>. </TD> assumed. SGML Parser. Found </TABLE> when expecting </TR>. </TR> assumed. SGML Parser. Start <TD> SGML Parser. End </TABLE> SGML Parser. Found </TABLE> when expecting </TD>. </TD> assumed. SGML Parser. Found </TABLE> when expecting </TD>. </TD> assumed. SGML Parser. Found </TABLE> when expecting </TD>. </TD> assumed. SGML Parser. Found </TABLE> when expecting </TR>. </TR> assumed. SGML Parser. Found </TABLE> when expecting </TD>. </TD> assumed. SGML Parser. Found </TABLE> when expecting </TD>. </TD> assumed. SGML Parser. Found </TABLE> when expecting </TD>. </TD> assumed. SGML Parser. Found </TABLE> when expecting </TR>. </TR> assumed. SGML Parser. End </P> SGML Parser. End </BODY> SGML Parser. End </HTML> Btw, the HTMLPDT.c includes line { "LI" , li_attr, HTML_LI_ATTRIBUTES, SGML_EMPTY }, LI is logically a container, should be SGML_MIXED instead. Similarly some other tags marked as "EMPTY" should be reconsidered (DD, DT, NOTE?, ...). Though, all this is unnecessary, if the SGML is simplified to plain SGML tokenizer (then empty/mixed is of no interest to it, though the flag might be useful for higher levels). -- Markku Savela (msa@hemuli.tte.vtt.fi), Technical Research Centre of Finland Multimedia Systems, P.O.Box 1203,FIN-02044 VTT,http://www.vtt.fi/tte/staff/msa/
Received on Wednesday, 14 August 1996 04:10:00 UTC