[DM] white space

The doc() function in F&O (and indirectly the document() function in
XSLT) specify that if the representation of a resource returned from
some URI is an XML file then the input tree should be constructed as
specified in DM, modulo some specific implementation dependent features
such as which uri schemes are supported.

In DM it says:

  6.7.3 Construction from an Infoset

  Applications may construct text nodes in the data model to represent
  insignificant white space. This decision is considered outside the scope
  of the data model, consequently the data model makes no attempt to
  control or identify if any or all insignificant white space is ignored 


This appears to be contradictory. Unless the document has been validated
(and so some element is known not to have mixed content) all space is
significant.  But this is describing building a datamodel from the
infoset not from the PSVI, so it hasn't been schema validated at least,
and I'm not sure if the DM really takes note of DTD validation as
currently written.


The only occurrence of the word "significant" in the infoset document is

    White space within start-tags (other than significant white space in
    attribute values) and end-tags.

which clearly is irrelevant here.


In current XSLT1 applications more or less the only significant
incompatibility between implementations (baring bugs) is msxsl's
tendency to drop spaces. (If called from an API a more conforming
behaviour can be specified, but notably _not_ if called via the
xml-stylesheet PI) This means that the (in most ways excellent) msxsl
implementation will render an xml fragment such as
<p><b>Bold</b> <span>words</span> <i>italic</i></p>
as
Boldwordsitalic
if given an "identity transform" to html as it will decide that
inter-word spaces are insignificant. Arguably this is conformant (if
confusing) behaviour as XSLT/XPath 1 said essentially nothing about how the
tree should be built. I believe that in version 2 of the language it is
clear that the wording should be clarified so that this unfortunate loss
of interoperabiliy (and usability) is clearly not allowed without some
specific user-option that requests it.


I fear that the wording in 6.7.3 was intended to authorise the dropping
of the interword spaces in my <p> example. It fails to do that as 
it refers to a term "insignificant white space" that is apparently
undefined, however I believe that the comment should be deleted rather
than fixed. It is an unnecessary optional clause to stop
interoperability, systems storing documents in efficient database
storage forms can construct the data model instance in any way they
like, there is no need to allow systems that are parsing explict XML
documents to have the same flexibility.


there is some discussion of this on xml-dev

http://lists.xml.org/archives/xml-dev/200307/msg00148.html

(and any number of posts on xsl-list where users have fallen into this
trap and asking where their spaces went, or why some node count that
went 1,2,3 on msxsl goes 2,4,6 on every other processor)


David

________________________________________________________________________
This e-mail has been scanned for all viruses by Star Internet. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________

Received on Friday, 5 December 2003 11:30:48 UTC