- From: Rainer Klute <klute@nads.de>
- Date: Wed, 17 Jan 1996 15:49:50 +0100
- To: msa@hemuli.tte.vtt.fi (Markku Savela)
- Cc: www-lib@w3.org, "Rainer Klute" <klute@nads.de>, "Jan Wedekind" <jan@todonix.ping.de>
>I wanted a support for something that would parse HTML as into >structured stream, and which would also give the missing end_element >calls, in proper places (so that actual browser part could concentrate >on the presentation instead of having to deal with syntactic things >that belong into the DTD). That is, separate the sometimes heuristic >methods of dealing with invalid HTML, omitted end tag rules or >different versions of HTML into a separate "unifier" stream. > >I am considering writing a structured stream that would be activated >with something like (to give you idea of the context).. > > SGML_new(&HTMLP_dtd, > HTML_Normalize(request, NULL, > input_format, output_format, > output_stream)); If you have a real SGML parser you won't need any special HTML normalizer because the SGML parser would give you everything you need. That is (would be :-() a tree structure of the elements of the document in question including those elements where one or both tags have been omitted due to the rules specified in the HTML DTD. The question is what an SGML parser will do when it encounters bad HTML or markup not specified in the DTD. (I think there are still HTML authors *not* using SGML editing tools, right? Oh, and those HTML "extensions", which are so popular these day. You'll never have a chance to catch them all in a DTD, not to speak of extensions you cannot bring into an SGML form.) I think the SGML parser should leave these constructs untouched as far as possible. It should not even insert closing tags into the output stream if they are not present in the input. There is at least one very popular browser that renders <ul> <li><p>foo <li><p>bar </ul> and <ul> <li><p>foo</p></li> <li><p>bar</p></li> </ul> differently. We are currently considering to experiment with Jim Clark's SGML parser SP in order to munge it into a libwww converter and to do some other things with it. Unfortunately it is still without documentation. Does anyone have experience with it? You are invited to join us guessing around! Dipl.-Inform. Rainer Klute NADS - Advertising on nets NADS GmbH Emil-Figge-Str. 80 Tel.: +49 231 9742570 D-44227 Dortmund Fax: +49 231 9742573 <http://www.nads.de/~klute/>
Received on Wednesday, 17 January 1996 09:50:00 UTC