Is it right to remove whitespace nodes stemming from CDATA sections? (No, I think!) from Nils Klarlund on 2000-02-08 (xsl-editors@w3.org from January to March 2000)

From: Nils Klarlund <klarlund@research.att.com>
Date: Tue, 8 Feb 2000 15:32:03 -0500
To: <xsl-editors@w3.org>
Cc: "klarlund" <klarlund@research.att.com>
Message-ID: <04bb01bf7273$90095740$b2e3cf87@research.att.com>

I believe that the way CDATA sections are treated in XPATH/XSLT is not
compatible with the latest Errata to XML 1.0.
(http://www.w3.org/XML/xml-19980210-errata).


Moreover, the way CDATA sections are treated makes it impossible to
adopt a simple view of XML, namely remove all whitespaces nodes,
without a provable loss of expressive power!  This radical pruning
view is desirable for many applications, especially for database
applications, but, also for document oriented processing, where the
usual semantics that introduce tons of whitespace nodes is an
aesthetic and practical problem.

The problem is that even a very explicitly marked whitespace such as

<![CDATA[ ]]>

is eaten up if not in company with non-whitespace characters.  So, I
can't insert spaces between nodes!

In other words, assuming that it is unreasonable that a DTD or
application should make decisions about which whitespace nodes are for
real and which are not, I'm in trouble: I want to prune all whitespace
nodes, except those that I mark as important.

Clearly, as indicated, in the section below, XML 1.0 makes semantic
distinctions between ' ' and <![CDATA[ ]]>.  Thus, XSLT cannot be used
to determine whether some content is "element content".  Does it
appear in error to water down XPATH to that point?

I suggest that the stripping of whitespace nodes explicitly excludes
nodes gotten from or involving CDATA sections.

Thanks

/Nils

From Errata:

Section 3

Change item number 2 of the list of valid cases for the "Element Valid" VC
to read:

The declaration matches children and the sequence of child elements
belongs to the language generated by the regular expression in the
content model, with optional white space (characters matching the
nonterminal S) between the start tag and the first child element,
between child elements or between the last child element and the end
tag. Note that a CDATA section containing only white space does not
match the nonterminal S, and hence cannot appear in these positions.

Received on Tuesday, 8 February 2000 15:37:50 UTC