Re: Surprising tokens! from Henry S. Thompson on 2001-10-22 (xmlschema-dev@w3.org from October 2001)

From: Henry S. Thompson <ht@cogsci.ed.ac.uk>
Date: 22 Oct 2001 11:21:01 +0100
To: Eric van der Vlist <vdv@dyomedea.com>
Cc: xmlschema-dev@w3.org
Message-ID: <f5bsncckmb6.fsf@cogsci.ed.ac.uk>
Eric van der Vlist <vdv@dyomedea.com> writes:

> I find the definition of the token datatype highly confusing:
> 
> http://www.w3.org/TR/xmlschema-2/#token
> 
> [Definition:]   token represents tokenized strings. The ·value space· of token
> is the set of strings that do not contain the line feed (#xA) nor tab (#x9)
> characters, that have no leading or trailing spaces (#x20) and that have no
> internal sequences of two or more spaces. The ·lexical space· of token is the
> set of strings that do not contain the line feed (#xA) nor tab (#x9)
> characters, that have no leading or trailing spaces (#x20) and that have no
> internal sequences of two or more spaces.
> 
> 
> and
> 
> <xs:simpleType name="token" id="token">
>   <xs:annotation>
>    <xs:documentation
>          source="http://www.w3.org/TR/xmlschema-2/#token"/>
>   </xs:annotation>
>   <xs:restriction base="xs:normalizedString">
>    <xs:whiteSpace value="collapse" id="token.whiteSpace"/>
>   </xs:restriction>
> </xs:simpleType>
> 
> What's the point of mentioning that "the ·value space· of token is
> the set of strings that do not contain the line feed (#xA) nor tab
> (#x9) characters, that have no leading or trailing spaces (#x20) and
> that have no internal sequences of two or more spaces" since
> xs:token has a whitespace behavior set to "collapse" which means
> that #xA, #x9 (and also #xD) will have been been replaced by #x20,
> that leading and trailing spaces will have been trimed and that any
> occurence of more than a single #x20 will have been replaced by a
> single #x20?

Um, I agree that the nature of the lexical and value spaces follows
from the definition of 'collapse', but we still need to say what they
are for completeness.

> Then, do we really want to give the same constraint on the lexical
> space?

Yes, because whitespace processing applies to _create_ the [schema
normalized value], which is the string constrained by any constraints
on lexical value.

> Why do we have a special treatment for #xD? If I read all this correctly,
> "t&#x20;&#xD;&#20;oken" is a valid xs:token. Is this expected?

That's a bug/inconsistency, in that 'collapse' specifies those are
gone too. 

> And, if we want to restrict the lexical value, wouldn't have been possible to
> do it through a pattern? Or is this something that cannot be expressed by the
> pattern syntax?

Sure, but what would that buy you.  The point is that the constraints
are simply a reflection of the reality of 'collapse' processing.  Even
if you built an infoset 'by hand' and then validated it, whitespace
processing still happens.  The statement of the constraints is there
so downstream applications know what invariants they can count on.

> Finally, if the purpose of xs:token is to represent "tokenized strings",
> wouldn't have been better named "xs:tokenized" or "xs:tokenizedString" to
> avoid the confusion with the "real" tokens (xs:NMTOKEN)?

Perhaps -- we didn't go for longer names than necessary, and I guess I 
really do think NMTOKENs are a subset of tokens in general.

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/
Received on Monday, 22 October 2001 06:36:45 UTC