Re: "Empty" Text Nodes

> I think this is one point where the XML/DOM specification totally blew
> it off. The above specification works well if your application is
> responsible for constructing the document tree from a parsed document.
> That's how the SAX parser API works, and you application can accept or
> reject whitespaces. The parser will let your application know when text
> is just whitespace.
> 
> In real life, most applications will just call some parser or other
> black box function and get a full DOM document tree in return. They will
> then iterate through the tree and either meet or won't meet whitespaces.
> I believe this specific use of the DOM is going to be very common, and
> is not clearly defined.
> 
> Assume that you write an application that retrieves a DOM document tree
> in some way (say, from a CORBA server). You then start iterating through
> it looking for interesting information. When testing the application,
> you use a validating parser, so whitespaces are removed from the
> document. You then run the application but notice a degradation in
> performance, so you turn the validation off, figuring all documents are
> correct anyway. Now it breaks, because the DOM document tree contains
> whitespaces that previously did not exist, and your application does not
> skip them.
> 
> It could even break if you opted to use a different parser, or
> reformatted the document.
> 
> It's not big deal to either transform the DOM tree to lose whitespaces,
> use a suitable iterator to skip them (org.w3c.dom.fi.NodeIterator), or
> simply expect whitespaces. Only problem is, it does not appear in the
> specs, and so will most likely not appear in any books, articles or code
> examples we're likely to read, so we'll never know about it until the
> first time we write code.
> 
> I think the W3C has the obligation to strictly clarify this point in the
> specification, so that: a) all parsers behave consistently, b) any
> undefined behavior is left as an option, not the default, c)
> applications can be written with more confidence. From the specification
> the issue will likely propogate to books, articles and knowledge bases,
> where even those not technically inclined to reading terse RFCs will
> find it.

Well put.  I have in the past brought up the white-space issue, and others 
relating to server-side DOM utilization, including CORBA concerns, and none of 
the DOM dons in residence have ever deigned to comment or answer.

Microsoft and Netscape will adopt the DOM as their political agenda allows, 
but we server-side implementors have a strong interest in being strictly 
DOM-compliant ASAP.  That is why the DOM group's persistence in blowing off 
this important audience is rather puzzling.

In developing 4DOM, we had to adopt proprietary asumptions for a variety of 
points on which the spec was not clear, and it seems that the other main DOM 
libraries, such as that in xml4j, have made their own assumptions as well.  
Probably not ideal from the standardization point of view.

-- 
Uche Ogbuji
FourThought LLC, IT Consultants
uche.ogbuji@fourthought.com	(970)481-0805
Software engineering, project management, Intranets and Extranets
http://FourThought.com		http://OpenTechnology.org

Received on Saturday, 27 February 1999 01:06:19 UTC