[Prev][Next][Index][Thread]

re : More about structured streams, SGML/HTML parser...



Hi,

>>I wanted a support for something that would parse HTML as into
>>structured stream, and which would also give the missing end_element
>>calls, in proper places (so that actual browser part could concentrate
>>on the presentation instead of having to deal with syntactic things
>>that belong into the DTD). That is, separate the sometimes heuristic
>>methods of dealing with invalid HTML, omitted end tag rules or
>>different versions of HTML into a separate "unifier" stream.
>>
>>I am considering writing a structured stream that would be activated
>>with something like (to give you idea of the context)..
>>
>>	SGML_new(&HTMLP_dtd,
>>	  HTML_Normalize(request, NULL,
>>			 input_format, output_format,
>>			 output_stream));
>
>If you have a real SGML parser you won't need any special HTML
>normalizer because the SGML parser would give you everything you
>need. That is (would be :-() a tree structure of the elements of
>the document in question including those elements where one or both
>tags have been omitted due to the rules specified in the HTML DTD.
>
>The question is what an SGML parser will do when it encounters bad
>HTML or markup not specified in the DTD. (I think there are still
>HTML authors *not* using SGML editing tools, right? Oh, and those
>HTML "extensions", which are so popular these day. You'll never
>have a chance to catch them all in a DTD, not to speak of
>extensions you cannot bring into an SGML form.) I think the SGML
>parser should leave these constructs untouched as far as possible. 
>It should not even insert closing tags into the output stream if
>they are not present in the input. There is at least one very
>popular browser that renders
>
>	<ul>
>	 <li><p>foo
>	 <li><p>bar
>	</ul>
>
>and
>
>	<ul>
>	 <li><p>foo</p></li>
>	 <li><p>bar</p></li>
>	</ul>
>
>differently.
>
>We are currently considering to experiment with Jim Clark's SGML
>parser SP in order to munge it into a libwww converter and to do
>some other things with it. Unfortunately it is still without
>documentation. Does anyone have experience with it? You are invited
>to join us guessing around!

We have the same kind of need but we would rather use a :

SGML_new(&HTMLP_dtd,
         HTML_Canonize(request, NULL,
			 input_format, output_format,
			 output_stream));

Which just canonizes the SGML stream on output, then we have our own
processing of Data. e.g this canonizer would add the omitted tags.
I do not agree that 
	<ul>
	 <li><p>foo
	 <li><p>bar
	</ul>
and
	<ul>
	 <li><p>foo</p></li>
	 <li><p>bar</p></li>
	</ul>
should be rendered differently, because the content is exactly the same, but
in fact the parsing or canonizing process should be tolerant enough to let
faulty stream pass in some situations. Warnings should be generated to allow
browser to take decision.
We are also considering using SGMLS, NSGMLS, SP (...) to canonize on-the-fly.
|-----------------------------------------------------------------|
| Laurent Vinesse                        |  JOUVE                 |
| Ingenieur logiciel / Software engineer |  12, rue des landelles |
| Telephone/Phone  : (+33) 99 86 98 12   |  Immeuble Hercule II   |
| Fax              : (+33) 99 86 98 01   |  35510 Cesson-Sevigne  |
| E-mail           : lvinesse@jouve.fr   |  FRANCE                |
|-----------------------------------------------------------------|



Follow-Ups: