[Prev][Next][Index][Thread]

Re: Question about libwww SGML/HTML parser...



Markku Savela writes:
> I have started experimenting with the SGML (HTML DTD) parser provided
> by the libwww 4.0B version. I am using my own structured stream, but
> was hoping to be able to use the HTMLPDTD.* in the library.
> 
> It seems that for many of the HTML tags, I get only call to the
> "start_element", but not to the "end_element", unless the HTML
> explicitly includes it.
> 
> And even more, if I have "<P> ... </P>", I get begin_element, but </P>
> is totally ignored. I can see the "why" from
> 
>     { "P"	, l_attr,	HTML_L_ATTRIBUTES,	SGML_EMPTY },
> 
> but, I am wondering, shouldn't <P> already be a "container", that is,
> SGML_MIXED. Similar question arises from some other tags, such as
> "LI" (we can have nested lists).
> 
> I guess this all comes the fact that the library does not really have
> full SGML parser, and the HTMLPDTD does not really define the full
> "DTD". It seems that large part of the DTD structuring rules (which
> tags are allowed within and after which tag) must be implemented in
> the start_element/end_element calls.

The SGML/HTML/HText has been very bad for a long time and we have considered a 
new version based on the experiences we have from Arena. Some of the design 
goals are:

      -	Character set independence. The parser should be capable of handling
	8, 16, and 32 bit character sets

      -	Intelligent error recovery with possibility of partial reparsing of
	certain data segments

      -	Event driven with incremental display

      -	Parse tree based for support of inline HTML editing
 
As mentioned this is only on the drawing board so inputs are appreciated!

-- 

Henrik Frystyk Nielsen, <frystyk@w3.org>
World-Wide Web Consortium, MIT/LCS NE43-356
545 Technology Square, Cambridge MA 02139, USA