[Prev][Next][Index][Thread]
Re: Question about libwww SGML/HTML parser...
Markku Savela writes:
> I have started experimenting with the SGML (HTML DTD) parser provided
> by the libwww 4.0B version. I am using my own structured stream, but
> was hoping to be able to use the HTMLPDTD.* in the library.
>
> It seems that for many of the HTML tags, I get only call to the
> "start_element", but not to the "end_element", unless the HTML
> explicitly includes it.
>
> And even more, if I have "<P> ... </P>", I get begin_element, but </P>
> is totally ignored. I can see the "why" from
>
> { "P" , l_attr, HTML_L_ATTRIBUTES, SGML_EMPTY },
>
> but, I am wondering, shouldn't <P> already be a "container", that is,
> SGML_MIXED. Similar question arises from some other tags, such as
> "LI" (we can have nested lists).
>
> I guess this all comes the fact that the library does not really have
> full SGML parser, and the HTMLPDTD does not really define the full
> "DTD". It seems that large part of the DTD structuring rules (which
> tags are allowed within and after which tag) must be implemented in
> the start_element/end_element calls.
The SGML/HTML/HText has been very bad for a long time and we have considered a
new version based on the experiences we have from Arena. Some of the design
goals are:
- Character set independence. The parser should be capable of handling
8, 16, and 32 bit character sets
- Intelligent error recovery with possibility of partial reparsing of
certain data segments
- Event driven with incremental display
- Parse tree based for support of inline HTML editing
As mentioned this is only on the drawing board so inputs are appreciated!
--
Henrik Frystyk Nielsen, <frystyk@w3.org>
World-Wide Web Consortium, MIT/LCS NE43-356
545 Technology Square, Cambridge MA 02139, USA