RE: normalizedString and its subtypes from Kay, Michael on 2002-07-18 (www-xml-schema-comments@w3.org from July to September 2002)

From: Kay, Michael <Michael.Kay@softwareag.com>
Date: Thu, 18 Jul 2002 18:15:16 +0200
To: ht@cogsci.ed.ac.uk, "Kay, Michael" <Michael.Kay@softwareag.com>
Cc: www-xml-schema-comments@w3.org
Message-ID: <DFF2AC9E3583D511A21F0008C7E621060453D9EA@daemsg02.software-ag.de>
Thanks for the response. I had missed the fact that the lexical space
represents the value after applying the whiteSpace normalization. This is a
useful insight that I think we need to take account of in defining the casts
and constructors for XQuery and XPath: these are currently defined to
require a value from the lexical space as input.

Michael Kay

> -----Original Message-----
> From: ht@cogsci.ed.ac.uk [mailto:ht@cogsci.ed.ac.uk] 
> Sent: 18 July 2002 15:41
> To: Kay, Michael
> Cc: www-xml-schema-comments@w3.org
> Subject: Re: normalizedString and its subtypes
> 
> 
> "Kay, Michael" <Michael.Kay@softwareag.com> writes:
> 
> > I am confused by the definitions of the built-in types 
> > normalizedString and its subtypes, in Schema Part 2.
> > 
> > (1). The value space of normalizedString allows all 
> characters except 
> > xD, xA, and x9. The lexical space allows all characters 
> except xD and 
> > x9. What is the mapping from the lexical space to the value space: 
> > what happens to an xA character in the lexical space (is it 
> removed? 
> > replaced by an x20?). The canonical lexical representation, 
> > presumably, is the same as the string in the value space: I 
> think we 
> > should be told.
> 
> The mapping from the lexical to the value space is 1-to-1 (I 
> think), so I think this is in fact a bug.  The builtin 
> derived type normalizedString is defined as having the value 
> 'replace' for its whiteSpace facet, which in turn means that 
> all strings offered for validation as normalizedStrings will 
> have had "[a]ll occurrences of #x9 (tab). #xA (line feed) and 
> #xD (carriage return) . . . replaced with #x20 (space)" [1].
> 
> > Presumably the lexical space represents the value after the 
> XML parser 
> > has done its normalization.
> 
> No, after that _and_ the _further_ normalization specified by 
> its whiteSpace facet.
> 
> > So in practice, a tab character is allowed in an
> > attribute of type normalizedString (because the XML parser 
> will turn 
> > it to a space), but a tab character is not allowed in an element of 
> > type normalizedString (because the XML parser will leave it 
> > unchanged). Is this interpretation correct?
> 
> No, because the the reference quoted above takes care of the 
> attribute/element difference.
> 
> > I find it hard to understand why the lexical space doesn't 
> allow any 
> > string, with a mapping to the value space achieved by normalizing 
> > whitespace characters. Alternatively, the lexical space should be 
> > identical to the value space. The current definition seems 
> > nonsensical.
> 
> I agree there's a bug.  I believe the 2nd alternative is 
> correct. There is a residual problem here to do with the 
> attempt to make the Datatypes REC usable independent of the 
> Structures REC, and the WG probably needs to step up to some 
> clarification here.
> 
> > (2). The type "token" ("tokens" would have been a better name) says 
> > that the value space allows all characters except xA or x9. 
> But since 
> > it is a restriction of normalizedString, it actually 
> appears to allow 
> > all characters except xA, xD, or x9. If the restriction is 
> going to be 
> > restated here, it should be restated in full.
> 
> There's an erratum pending [2] which will say precisely this.
> 
> > (3). The three subtypes of "token" do not allow any whitespace 
> > characters in the value.  Why is there no supertype for 
> these ("token" 
> > would have been a good name) that allows any string containing no 
> > whitespace characters? I would have thought this type would 
> be vastly 
> > more useful than most of the other built-in subtypes of string.
> 
> Good idea -- perhaps we'll add this in 1.1
> 
> ht
> 
> [1] 
> http://www.w3.org/TR/xmlschema-1/#section-White-Space-Normaliz
ation-during-Validation
[2] http://www.w3.org/2001/05/xmlschema-rec-comments.html/#pfitoken
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2002, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/
 [mail really from me _always_ has this .sig -- mail without it is forged
spam]
Received on Thursday, 18 July 2002 12:15:25 UTC