RE: Question about libwww SGML/HTML parser...

Hi!

> I have started experimenting with the SGML (HTML DTD) parser provided
> by the libwww 4.0B version. I am using my own structured stream, but
> was hoping to be able to use the HTMLPDTD.* in the library.
> 
> It seems that for many of the HTML tags, I get only call to the
> "start_element", but not to the "end_element", unless the HTML
> explicitly includes it.
> 
> And even more, if I have "<P> ... </P>", I get begin_element, but </P>
> is totally ignored. I can see the "why" from
>
>   { "P"	, l_attr,	HTML_L_ATTRIBUTES,	SGML_EMPTY },
>
> but, I am wondering, shouldn't <P> already be a "container", that is,
> SGML_MIXED. Similar question arises from some other tags, such as
> "LI" (we can have nested lists).

[ 
L a s t  N e w s:
Rainer Klute wrote:
> There's no such thing as an SGML parser in the library. There's
> just some code that has been called that way ages ago. Nobody knows
> the reason why.
With this in mind, while reading the following please change all occurrences of
"SGML parser" to "just some code" and "SGML parsing" to "just some coding".
   ;-)
]

I'm not an SGML expert, but perhaps my answer will help you.

I had a lot of problems with the SGML parsing (SGML.c and HTMLPDTD.c),
and I introduced some changes in my copy of the library code. Unfortunately
none of them has been introduced into the distribution version (that's partly my
fault: I have sent one patch to Henrik - no result, so I haven't sent any more). 
If you are interested, I can share my ideas with you. These include:

1. Better error recovery from ill-formed documents
2. Special handling for <P> tags (consider HTML 2.0 construct: <P ALIGN=...> )
3. Some other minor changes

For now I can say that I don't see anything wrong that for some tags only
start_element is called. This doesn't preclude using nested lists. For example
my simple browser (written using the wwwlib) handles nested lists very well
(although I spent a lot of time to make everything work correctly).

> I guess this all comes the fact that the library does not really have
> full SGML parser, and the HTMLPDTD does not really define the full
> "DTD". It seems that large part of the DTD structuring rules (which
> tags are allowed within and after which tag) must be implemented in
> the start_element/end_element calls.
>
> The question? I am wondering if there should be a structured stream
> very much like what you get with SGML + HTMLPDTD.c combination, but
> which would provide the missing rules and application could rely on
> getting *all* end_element calls, whether original HTML had them or
> not?

Hmmm... HTML documents found in Internet are sometimes so ill-formed
that it would be a really difficult task to correct them. 
When using the current output from the SGML parser the application should 
not rely on anything. For example, you know that you can get <LI> without
</LI>, but do you know that you can also get <LI> without any <UL>, <OL>
etc. ?

> [snip...snip...]
> Or, is there such already in the library (as far as I can see the
> HText module goes much further, already interprets more than some
> might want...).

There is no HText module in the library (only the interface declaration).
Perhaps you think about the HMTL.c module. I agree, it does sometimes
strange things. This is the module I have changed to the biggest extent.
The original version handles only HTML 1.0 and is designed rather for 
character mode displays. My extensions made it capable for displaying
HTML documents in the graphics environment pretty well (e.g. it handles
nested styles). I haven't introduced any HTML 2.0 complex features 
(e.g. forms, tables etc), unfortunately. If you are interested, I can give you
the code and explanations.


Maciej Puzio
puzio@laser.mimuw.edu.pl

Received on Monday, 15 January 1996 11:34:01 UTC