[DM] white space

From: David Carlisle <davidc@nag.co.uk>
Date: Fri, 5 Dec 2003 16:30:26 GMT
Message-Id: <200312051630.QAA17058@penguin.nag.co.uk>
To: public-qt-comments@w3.org

The doc() function in F&O (and indirectly the document() function in
XSLT) specify that if the representation of a resource returned from
some URI is an XML file then the input tree should be constructed as
specified in DM, modulo some specific implementation dependent features
such as which uri schemes are supported.

In DM it says:

  6.7.3 Construction from an Infoset

  Applications may construct text nodes in the data model to represent
  insignificant white space. This decision is considered outside the scope
  of the data model, consequently the data model makes no attempt to
  control or identify if any or all insignificant white space is ignored 

This appears to be contradictory. Unless the document has been validated
(and so some element is known not to have mixed content) all space is
significant.  But this is describing building a datamodel from the
infoset not from the PSVI, so it hasn't been schema validated at least,
and I'm not sure if the DM really takes note of DTD validation as
currently written.

The only occurrence of the word "significant" in the infoset document is

    White space within start-tags (other than significant white space in
    attribute values) and end-tags.

which clearly is irrelevant here.

In current XSLT1 applications more or less the only significant
incompatibility between implementations (baring bugs) is msxsl's
tendency to drop spaces. (If called from an API a more conforming
behaviour can be specified, but notably _not_ if called via the
xml-stylesheet PI) This means that the (in most ways excellent) msxsl
implementation will render an xml fragment such as
<p><b>Bold</b> <span>words</span> <i>italic</i></p>
if given an "identity transform" to html as it will decide that
inter-word spaces are insignificant. Arguably this is conformant (if
confusing) behaviour as XSLT/XPath 1 said essentially nothing about how the
tree should be built. I believe that in version 2 of the language it is
clear that the wording should be clarified so that this unfortunate loss
of interoperabiliy (and usability) is clearly not allowed without some
specific user-option that requests it.

I fear that the wording in 6.7.3 was intended to authorise the dropping
of the interword spaces in my <p> example. It fails to do that as 
it refers to a term "insignificant white space" that is apparently
undefined, however I believe that the comment should be deleted rather
than fixed. It is an unnecessary optional clause to stop
interoperability, systems storing documents in efficient database
storage forms can construct the data model instance in any way they
like, there is no need to allow systems that are parsing explict XML
documents to have the same flexibility.

