- From: Arkin <arkin@trendline.co.il>
- Date: Thu, 25 Feb 1999 11:25:56 -0500
- To: Oliver Becker <obecker@informatik.hu-berlin.de>
- CC: www-dom@w3.org
> Strictly spoken is a HTML processor at present a specific SGML processor. > That means e.g. (according to the HTML DTD) some start or end tags of > elements may be omitted. If you look at the HTML DTD you'll notice that it is a valid SGML DTD, but not a valid XML DTD. The optional open and close tag is one of the major differences. > If we have a HTML DTD in XML then all tags must appear. Omitting tags > is not allowed any longer. For browsers this is again a theoretical > demand: what to do if an author doesn't play the game by the rules? That's why an XML parser cannot read HTML, unless its purpose is to complain about the lacking structure. In HTML many elements are assumed to exist. For example, the HEAD element always exists, even if the tag is not in the document. If there is no HTML, BODY or HEAD, everything goes inside the BODY. P can enclose or just terminate a paragraph. LI begins a list item and everything following to the next LI or closing UL/OL is a child of LI. Free floating text inside a table is considered a row or a cell, depending on its context and so on. All these strange rules exist because browsers are not expected to report parsing errors to the users, and Web masters are expected to produce invalid documents. HTML is not an information structure and need not be as strict or well formed as XML. > > 2. PRE, STYLE and SCRIPT are specific cases in HTML, unlike other > > elements. They are whitespace preserving and do not process elements in > > their content. > > Sorry, that's not correct. E.g. PRE may contain special elements like A > or IMG, phrase elements like EM and STRONG, and even form control elements. Stand correct on that one. PRE may contain element (STYLE and SCRIPT do not), but has special processing rules for dealing with whitespace. This is the only occurance in which tab, newline and space are treated different. > > 6. Without a validating XML processor, XML elements should attempt to > > ignore as much whitespace as possible, regarding it as human readable > > whitespace. > > I agree. > But as I see from other postings the opinions, if whitespaces should be > reported or not, are quite different. Reporting back to the application is an interesting issue. SAX parsers tend to report redundant whitespaces as such to the application, so the application can choose whether to discard them or not. However, more applications prefer to work with a full DOM tree, not to make it out from the parser. So applications either have to skip redundant whitespace inbetween elements, or not. Applications may prefer not to use a validating parser if they assume the document is valid and would prefer faster parsing. In that case, the non-validating processor should behave reasonably well. Arkin > > I should think about it a little while ... > > Cheers, > Oliver > > /-------------------------------------------------------------------\ > | ob|do Dipl.Inf. Oliver Becker | > | --+-- E-Mail: obecker@informatik.hu-berlin.de | > | op|qo WWW: http://www.informatik.hu-berlin.de/~obecker | > \-------------------------------------------------------------------/
Received on Thursday, 25 February 1999 11:32:31 UTC