Re: "Empty" Text Nodes from Arkin on 1999-02-25 (www-dom@w3.org from January to March 1999)

From: Arkin <arkin@trendline.co.il>
Date: Thu, 25 Feb 1999 11:25:56 -0500
To: Oliver Becker <obecker@informatik.hu-berlin.de>
CC: www-dom@w3.org
Message-ID: <36D57994.50C2B77F@trendline.co.il>
> Strictly spoken is a HTML processor at present a specific SGML processor.
> That means e.g. (according to the HTML DTD) some start or end tags of
> elements may be omitted.

If you look at the HTML DTD you'll notice that it is a valid SGML DTD,
but not a valid XML DTD. The optional open and close tag is one of the
major differences.

> If we have a HTML DTD in XML then all tags must appear. Omitting tags
> is not allowed any longer. For browsers this is again a theoretical
> demand: what to do if an author doesn't play the game by the rules?

That's why an XML parser cannot read HTML, unless its purpose is to
complain about the lacking structure. In HTML many elements are assumed
to exist. For example, the HEAD element always exists, even if the tag
is not in the document. If there is no HTML, BODY or HEAD, everything
goes inside the BODY. P can enclose or just terminate a paragraph. LI
begins a list item and everything following to the next LI or closing
UL/OL is a child of LI. Free floating text inside a table is considered
a row or a cell, depending on its context and so on.

All these strange rules exist because browsers are not expected to
report parsing errors to the users, and Web masters are expected to
produce invalid documents. HTML is not an information structure and need
not be as strict or well formed as XML.

> > 2. PRE, STYLE and SCRIPT are specific cases in HTML, unlike other
> > elements. They are whitespace preserving and do not process elements in
> > their content.
> 
> Sorry, that's not correct. E.g. PRE may contain special elements like A
> or IMG, phrase elements like EM and STRONG, and even form control elements.

Stand correct on that one. PRE may contain element (STYLE and SCRIPT do
not), but has special processing rules for dealing with whitespace. This
is the only occurance in which tab, newline and space are treated
different.

> > 6. Without a validating XML processor, XML elements should attempt to
> > ignore as much whitespace as possible, regarding it as human readable
> > whitespace.
> 
> I agree.
> But as I see from other postings the opinions, if whitespaces should be
> reported or not, are quite different.

Reporting back to the application is an interesting issue. SAX parsers
tend to report redundant whitespaces as such to the application, so the
application can choose whether to discard them or not. However, more
applications prefer to work with a full DOM tree, not to make it out
from the parser.

So applications either have to skip redundant whitespace inbetween
elements, or not. Applications may prefer not to use a validating parser
if they assume the document is valid and would prefer faster parsing. In
that case, the non-validating processor should behave reasonably well.

Arkin

> 
> I should think about it a little while ...
> 
> Cheers,
> Oliver
> 
> /-------------------------------------------------------------------\
> |  ob|do        Dipl.Inf. Oliver Becker                             |
> |  --+--        E-Mail: obecker@informatik.hu-berlin.de             |
> |  op|qo        WWW:    http://www.informatik.hu-berlin.de/~obecker |
> \-------------------------------------------------------------------/
Received on Thursday, 25 February 1999 11:32:31 UTC