Whitespace facet and validation from Thomson, Martin on 2010-11-17 (xmlschema-dev@w3.org from November 2010)

From: Thomson, Martin <Martin.Thomson@andrew.com>
Date: Wed, 17 Nov 2010 08:27:41 +0800
To: "xmlschema-dev@w3.org" <xmlschema-dev@w3.org>
Message-ID: <8B0A9FCBB9832F43971E38010638454F03F33EA382@SISPE7MB1.commscope.com>

I know that whitespace is an area that is filled with pitfalls, but I have been asked a very simple question and I don't know what the right answer is.

  Does the value space for xs:token include U+A, U+9, U+D, or consecutive U+20 characters?

Superficially at least, [1] seems to say no.  But [1] refers to [2] in its definition.  The text and examples in [2] clearly show that the product of normalization can include (at least some of) these characters by using character references (&#---;) as a form of escaping.  

Reading this, I had assumed that "<x>&#x20;a&#xa;</x>" would produce PSVI containing " a\n" even if the whitespace facet is set to "collapse".

After being reminded that schema validation works on the infoset, I discovered remnants of whitespace handling in [3].  I find that description somewhat puzzling, but I assume that in referring to xml:space [4] it defers whitespace processing to the "application".  The description of xml:space="default" could be read in any number of ways [5].

[4] requires that validating processor pass information on whitespace to applications.  That, along with [2], could suggest that the answer to the question was originally intended to be 'yes'.

On the other hand, I've not found an implementation that either passes whitespace information or passes any of the above characters to an application.  So the de facto answer appears to be 'no'.

I'm seeking some clarification on this question.  What is intended [6]?

Cheers,
Martin

[1] http://www.w3.org/TR/xmlschema11-2/#rf-whiteSpace

[2] http://www.w3.org/TR/xml/#AVNormalize

[3] http://www.w3.org/TR/xml-infoset/#infoitem.character

[4] http://www.w3.org/TR/xml/#sec-white-space

[5] XML and XML Infoset define different whitespace handling for attribute values and element content.  Fun.
[6] From a purely pragmatic standpoint, expecting behaviour other than what existing implementations exhibit would be foolish.  But that doesn't stop me from thinking that this could somehow be useful.  The attribute-value normalization behaviour is complex, but being able to represent any character is a useful characteristic.

Received on Wednesday, 17 November 2010 00:28:33 UTC