W3C home > Mailing lists > Public > www-lib@w3.org > January to March 1996

Re: More about structured streams, SGML/HTML parser...

From: Rainer Klute <klute@nads.de>
Date: Wed, 17 Jan 1996 15:49:50 +0100
Message-Id: <199601171449.PAA15796@heike.nads.de>
To: msa@hemuli.tte.vtt.fi (Markku Savela)
Cc: www-lib@w3.org, "Rainer Klute" <klute@nads.de>, "Jan Wedekind" <jan@todonix.ping.de>
>I wanted a support for something that would parse HTML as into
>structured stream, and which would also give the missing end_element
>calls, in proper places (so that actual browser part could concentrate
>on the presentation instead of having to deal with syntactic things
>that belong into the DTD). That is, separate the sometimes heuristic
>methods of dealing with invalid HTML, omitted end tag rules or
>different versions of HTML into a separate "unifier" stream.
>
>I am considering writing a structured stream that would be activated
>with something like (to give you idea of the context)..
>
>	SGML_new(&HTMLP_dtd,
>	  HTML_Normalize(request, NULL,
>			 input_format, output_format,
>			 output_stream));

If you have a real SGML parser you won't need any special HTML
normalizer because the SGML parser would give you everything you
need. That is (would be :-() a tree structure of the elements of
the document in question including those elements where one or both
tags have been omitted due to the rules specified in the HTML DTD.

The question is what an SGML parser will do when it encounters bad
HTML or markup not specified in the DTD. (I think there are still
HTML authors *not* using SGML editing tools, right? Oh, and those
HTML "extensions", which are so popular these day. You'll never
have a chance to catch them all in a DTD, not to speak of
extensions you cannot bring into an SGML form.) I think the SGML
parser should leave these constructs untouched as far as possible. 
It should not even insert closing tags into the output stream if
they are not present in the input. There is at least one very
popular browser that renders

	<ul>
	 <li><p>foo
	 <li><p>bar
	</ul>

and

	<ul>
	 <li><p>foo</p></li>
	 <li><p>bar</p></li>
	</ul>

differently.

We are currently considering to experiment with Jim Clark's SGML
parser SP in order to munge it into a libwww converter and to do
some other things with it. Unfortunately it is still without
documentation. Does anyone have experience with it? You are invited
to join us guessing around!

  Dipl.-Inform. Rainer Klute        NADS - Advertising on nets
  NADS GmbH
  Emil-Figge-Str. 80                Tel.: +49 231 9742570
D-44227 Dortmund                    Fax:  +49 231 9742573

            <http://www.nads.de/~klute/>
Received on Wednesday, 17 January 1996 09:50:00 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 23 April 2007 18:18:25 GMT