Re: Whitespace facet and validation from Michael Kay on 2010-11-17 (xmlschema-dev@w3.org from November 2010)

From: Michael Kay <mike@saxonica.com>
Date: Wed, 17 Nov 2010 09:27:50 +0000
To: "Thomson, Martin" <Martin.Thomson@andrew.com>
CC: "xmlschema-dev@w3.org" <xmlschema-dev@w3.org>
Message-ID: <4CE3A016.8060705@saxonica.com>
Yes, this is a complicated area, and it's not very clearly described in 
the specs.

The processing model is as follows:

(1) XML processing normalizes attribute values and expands entity and 
character references. You're talking about an element here so I'm not 
sure why you cited attribute normalization; but either way, entity and 
character references are expanded before schema processing starts. The 
output of this stage is an infoset, which acts as the input to the 
schema processor.

(2) The schema processor preprocesses whitespace in the element or 
attribute according to the whitespace facet for the relevant type. For 
xs:token and most other data types, this has the value "collapse", which 
removes leading and trailing whitespace, and reduces internal whitespace 
to a single space (x20) character.

(3) The value after whitespace preprocessing must be in the lexical 
space of the data type. In the case of xs:token (a data type whose name 
is singularly inappropriate and misleading, by the way), the definition 
of the lexical space is such that the result of whitespace preprocessing 
will always be a value in the lexical space.

(4) This lexical value is then converted to a value in the value space 
of the data type. For xs:token and other subtypes of xs:string, this is 
essentially a null operation.

So the answer to your question is that when you use xs:token in a 
schema, values encountered in instance documents will always be valid 
regardless what whitespace they contain, and regardless whether it is 
written using entity or character references. The value space does not 
allow certain combinations of whitespace, but this will never cause 
validation errors, because the whitespace facet ensures that such 
sequences will never occur after whitespace preprocessing.

Michael Kay
Saxonica

On 17/11/2010 00:27, Thomson, Martin wrote:
> I know that whitespace is an area that is filled with pitfalls, but I have been asked a very simple question and I don't know what the right answer is.
>
>    Does the value space for xs:token include U+A, U+9, U+D, or consecutive U+20 characters?
>
> Superficially at least, [1] seems to say no.  But [1] refers to [2] in its definition.  The text and examples in [2] clearly show that the product of normalization can include (at least some of) these characters by using character references (&#---;) as a form of escaping.
>
> Reading this, I had assumed that "<x>&#x20;a&#xa;</x>" would produce PSVI containing " a\n" even if the whitespace facet is set to "collapse".
>
> After being reminded that schema validation works on the infoset, I discovered remnants of whitespace handling in [3].  I find that description somewhat puzzling, but I assume that in referring to xml:space [4] it defers whitespace processing to the "application".  The description of xml:space="default" could be read in any number of ways [5].
>
> [4] requires that validating processor pass information on whitespace to applications.  That, along with [2], could suggest that the answer to the question was originally intended to be 'yes'.
>
> On the other hand, I've not found an implementation that either passes whitespace information or passes any of the above characters to an application.  So the de facto answer appears to be 'no'.
>
> I'm seeking some clarification on this question.  What is intended [6]?
>
> Cheers,
> Martin
>
> [1] http://www.w3.org/TR/xmlschema11-2/#rf-whiteSpace
> [2] http://www.w3.org/TR/xml/#AVNormalize
> [3] http://www.w3.org/TR/xml-infoset/#infoitem.character
> [4] http://www.w3.org/TR/xml/#sec-white-space
> [5] XML and XML Infoset define different whitespace handling for attribute values and element content.  Fun.
> [6] From a purely pragmatic standpoint, expecting behaviour other than what existing implementations exhibit would be foolish.  But that doesn't stop me from thinking that this could somehow be useful.  The attribute-value normalization behaviour is complex, but being able to represent any character is a useful characteristic.
Received on Wednesday, 17 November 2010 09:28:18 UTC