Re: "Empty" Text Nodes from Arkin on 1999-02-26 (www-dom@w3.org from January to March 1999)

From: Arkin <arkin@trendline.co.il>
Date: Fri, 26 Feb 1999 11:32:22 -0500
CC: www-dom@w3.org
Message-ID: <36D6CC96.C6BC60C6@trendline.co.il>
Oliver Becker wrote:
> 
> Ok, let's summarize:
> 
> The XML spec states that "An XML processor must always pass all
> characters in a document that are not markup through to the
> application".
> 
> My question: on which level the DOM is aligned?
> Is it attached to a XML parser, or is it more
> attached to a XML application?

I think this is one point where the XML/DOM specification totally blew
it off. The above specification works well if your application is
responsible for constructing the document tree from a parsed document.
That's how the SAX parser API works, and you application can accept or
reject whitespaces. The parser will let your application know when text
is just whitespace.

In real life, most applications will just call some parser or other
black box function and get a full DOM document tree in return. They will
then iterate through the tree and either meet or won't meet whitespaces.
I believe this specific use of the DOM is going to be very common, and
is not clearly defined.

Assume that you write an application that retrieves a DOM document tree
in some way (say, from a CORBA server). You then start iterating through
it looking for interesting information. When testing the application,
you use a validating parser, so whitespaces are removed from the
document. You then run the application but notice a degradation in
performance, so you turn the validation off, figuring all documents are
correct anyway. Now it breaks, because the DOM document tree contains
whitespaces that previously did not exist, and your application does not
skip them.

It could even break if you opted to use a different parser, or
reformatted the document.

It's not big deal to either transform the DOM tree to lose whitespaces,
use a suitable iterator to skip them (org.w3c.dom.fi.NodeIterator), or
simply expect whitespaces. Only problem is, it does not appear in the
specs, and so will most likely not appear in any books, articles or code
examples we're likely to read, so we'll never know about it until the
first time we write code.

I think the W3C has the obligation to strictly clarify this point in the
specification, so that: a) all parsers behave consistently, b) any
undefined behavior is left as an option, not the default, c)
applications can be written with more confidence. From the specification
the issue will likely propogate to books, articles and knowledge bases,
where even those not technically inclined to reading terse RFCs will
find it.

Arkin


> In the latter case (validating parser with known DTD) the object
> structure should not contain text nodes with whitespaces if the
> content model of the element doesn't contain #PCDATA and xml:space
> is not "preserve".
> 
> On the other hand it's no big effort transforming a DOM tree with
> whitespace text nodes into another one without them ...
> 
> Regards,
> Oliver
> 
> /-------------------------------------------------------------------\
> |  ob|do        Dipl.Inf. Oliver Becker                             |
> |  --+--        E-Mail: obecker@informatik.hu-berlin.de             |
> |  op|qo        WWW:    http://www.informatik.hu-berlin.de/~obecker |
> \-------------------------------------------------------------------/
Received on Friday, 26 February 1999 11:38:23 UTC