Question about libwww SGML/HTML parser...

I have started experimenting with the SGML (HTML DTD) parser provided
by the libwww 4.0B version. I am using my own structured stream, but
was hoping to be able to use the HTMLPDTD.* in the library.

It seems that for many of the HTML tags, I get only call to the
"start_element", but not to the "end_element", unless the HTML
explicitly includes it.

And even more, if I have "<P> ... </P>", I get begin_element, but </P>
is totally ignored. I can see the "why" from

    { "P"	, l_attr,	HTML_L_ATTRIBUTES,	SGML_EMPTY },

but, I am wondering, shouldn't <P> already be a "container", that is,
SGML_MIXED. Similar question arises from some other tags, such as
"LI" (we can have nested lists).

I guess this all comes the fact that the library does not really have
full SGML parser, and the HTMLPDTD does not really define the full
"DTD". It seems that large part of the DTD structuring rules (which
tags are allowed within and after which tag) must be implemented in
the start_element/end_element calls.

The question? I am wondering if there should be a structured stream
very much like what you get with SGML + HTMLPDTD.c combination, but
which would provide the missing rules and application could rely on
getting *all* end_element calls, whether original HTML had them or
not? For example, the coding

	<UL>
	  <LI> text
	  <LI> <UL> <LI> <P>text</P><P>text</P> </UL>
	</UL>

would instead of

	begin UL
	begin LI
	begin LI
	begin UL
	begin LI
	begin P
	begin P
	end UL
	end UL

give
	begin UL
	begin LI
	end LI
	begin LI
	begin UL
	begin LI
	begin P
	end P
	begin P
	end P
	end LI
	end UL
	end LI
	end UL

With this, at least everyone would consistently agree what tag
implicitly ends what.

Or, is there such already in the library (as far as I can see the
HText module goes much further, already interprets more than some
might want...).

Ps. I am not on this list (www-lib), CC any possible replies to me

--
Markku Savela (msa@hemuli.tte.vtt.fi),     Technical Research Centre of Finland
Multimedia Systems, P.O.Box 1203,FIN-02044 VTT,http://www.vtt.fi/tte/staff/msa/

Received on Monday, 15 January 1996 09:44:57 UTC