- From: Michael Kay <mike@saxonica.com>
- Date: Wed, 17 Nov 2010 09:27:50 +0000
- To: "Thomson, Martin" <Martin.Thomson@andrew.com>
- CC: "xmlschema-dev@w3.org" <xmlschema-dev@w3.org>
Yes, this is a complicated area, and it's not very clearly described in the specs. The processing model is as follows: (1) XML processing normalizes attribute values and expands entity and character references. You're talking about an element here so I'm not sure why you cited attribute normalization; but either way, entity and character references are expanded before schema processing starts. The output of this stage is an infoset, which acts as the input to the schema processor. (2) The schema processor preprocesses whitespace in the element or attribute according to the whitespace facet for the relevant type. For xs:token and most other data types, this has the value "collapse", which removes leading and trailing whitespace, and reduces internal whitespace to a single space (x20) character. (3) The value after whitespace preprocessing must be in the lexical space of the data type. In the case of xs:token (a data type whose name is singularly inappropriate and misleading, by the way), the definition of the lexical space is such that the result of whitespace preprocessing will always be a value in the lexical space. (4) This lexical value is then converted to a value in the value space of the data type. For xs:token and other subtypes of xs:string, this is essentially a null operation. So the answer to your question is that when you use xs:token in a schema, values encountered in instance documents will always be valid regardless what whitespace they contain, and regardless whether it is written using entity or character references. The value space does not allow certain combinations of whitespace, but this will never cause validation errors, because the whitespace facet ensures that such sequences will never occur after whitespace preprocessing. Michael Kay Saxonica On 17/11/2010 00:27, Thomson, Martin wrote: > I know that whitespace is an area that is filled with pitfalls, but I have been asked a very simple question and I don't know what the right answer is. > > Does the value space for xs:token include U+A, U+9, U+D, or consecutive U+20 characters? > > Superficially at least, [1] seems to say no. But [1] refers to [2] in its definition. The text and examples in [2] clearly show that the product of normalization can include (at least some of) these characters by using character references (&#---;) as a form of escaping. > > Reading this, I had assumed that "<x> a
</x>" would produce PSVI containing " a\n" even if the whitespace facet is set to "collapse". > > After being reminded that schema validation works on the infoset, I discovered remnants of whitespace handling in [3]. I find that description somewhat puzzling, but I assume that in referring to xml:space [4] it defers whitespace processing to the "application". The description of xml:space="default" could be read in any number of ways [5]. > > [4] requires that validating processor pass information on whitespace to applications. That, along with [2], could suggest that the answer to the question was originally intended to be 'yes'. > > On the other hand, I've not found an implementation that either passes whitespace information or passes any of the above characters to an application. So the de facto answer appears to be 'no'. > > I'm seeking some clarification on this question. What is intended [6]? > > Cheers, > Martin > > [1] http://www.w3.org/TR/xmlschema11-2/#rf-whiteSpace > [2] http://www.w3.org/TR/xml/#AVNormalize > [3] http://www.w3.org/TR/xml-infoset/#infoitem.character > [4] http://www.w3.org/TR/xml/#sec-white-space > [5] XML and XML Infoset define different whitespace handling for attribute values and element content. Fun. > [6] From a purely pragmatic standpoint, expecting behaviour other than what existing implementations exhibit would be foolish. But that doesn't stop me from thinking that this could somehow be useful. The attribute-value normalization behaviour is complex, but being able to represent any character is a useful characteristic.
Received on Wednesday, 17 November 2010 09:28:18 UTC