- From: Henrik Frystyk Nielsen <frystyk@w3.org>
- Date: Thu, 18 Jan 1996 11:15:44 -0500
- To: msa@hemuli.tte.vtt.fi (Markku Savela)
- Cc: www-lib@w3.org, msa@msa.tte.vtt.fi, mjk@hemuli.tte.vtt.fi
Markku Savela writes: > I have started experimenting with the SGML (HTML DTD) parser provided > by the libwww 4.0B version. I am using my own structured stream, but > was hoping to be able to use the HTMLPDTD.* in the library. > > It seems that for many of the HTML tags, I get only call to the > "start_element", but not to the "end_element", unless the HTML > explicitly includes it. > > And even more, if I have "<P> ... </P>", I get begin_element, but </P> > is totally ignored. I can see the "why" from > > { "P" , l_attr, HTML_L_ATTRIBUTES, SGML_EMPTY }, > > but, I am wondering, shouldn't <P> already be a "container", that is, > SGML_MIXED. Similar question arises from some other tags, such as > "LI" (we can have nested lists). > > I guess this all comes the fact that the library does not really have > full SGML parser, and the HTMLPDTD does not really define the full > "DTD". It seems that large part of the DTD structuring rules (which > tags are allowed within and after which tag) must be implemented in > the start_element/end_element calls. The SGML/HTML/HText has been very bad for a long time and we have considered a new version based on the experiences we have from Arena. Some of the design goals are: - Character set independence. The parser should be capable of handling 8, 16, and 32 bit character sets - Intelligent error recovery with possibility of partial reparsing of certain data segments - Event driven with incremental display - Parse tree based for support of inline HTML editing As mentioned this is only on the drawing board so inputs are appreciated! -- Henrik Frystyk Nielsen, <frystyk@w3.org> World-Wide Web Consortium, MIT/LCS NE43-356 545 Technology Square, Cambridge MA 02139, USA
Received on Thursday, 18 January 1996 11:16:30 UTC