Re: normalizedString and its subtypes from Henry S. Thompson on 2002-07-18 (www-xml-schema-comments@w3.org from July to September 2002)

From: Henry S. Thompson <ht@cogsci.ed.ac.uk>
Date: 18 Jul 2002 15:40:57 +0100
To: "Kay, Michael" <Michael.Kay@softwareag.com>
Cc: www-xml-schema-comments@w3.org
Message-ID: <f5by9c93xqe.fsf@cogsci.ed.ac.uk>
"Kay, Michael" <Michael.Kay@softwareag.com> writes:

> I am confused by the definitions of the built-in types normalizedString and
> its subtypes, in Schema Part 2.
> 
> (1). The value space of normalizedString allows all characters except xD,
> xA, and x9. The lexical space allows all characters except xD and x9. What
> is the mapping from the lexical space to the value space: what happens to an
> xA character in the lexical space (is it removed? replaced by an x20?). The
> canonical lexical representation, presumably, is the same as the string in
> the value space: I think we should be told.

The mapping from the lexical to the value space is 1-to-1 (I think),
so I think this is in fact a bug.  The builtin derived type
normalizedString is defined as having the value 'replace' for its
whiteSpace facet, which in turn means that all strings offered for
validation as normalizedStrings will have had "[a]ll occurrences of
#x9 (tab). #xA (line feed) and #xD (carriage return) . . . replaced
with #x20 (space)" [1].

> Presumably the lexical space represents the value after the XML parser has
> done its normalization. 

No, after that _and_ the _further_ normalization specified by its
whiteSpace facet.

> So in practice, a tab character is allowed in an
> attribute of type normalizedString (because the XML parser will turn it to a
> space), but a tab character is not allowed in an element of type
> normalizedString (because the XML parser will leave it unchanged). Is this
> interpretation correct?

No, because the the reference quoted above takes care of the
attribute/element difference.

> I find it hard to understand why the lexical space doesn't allow any string,
> with a mapping to the value space achieved by normalizing whitespace
> characters. Alternatively, the lexical space should be identical to the
> value space. The current definition seems nonsensical.

I agree there's a bug.  I believe the 2nd alternative is correct.
There is a residual problem here to do with the attempt to make the
Datatypes REC usable independent of the Structures REC, and the WG
probably needs to step up to some clarification here.

> (2). The type "token" ("tokens" would have been a better name) says that the
> value space allows all characters except xA or x9. But since it is a
> restriction of normalizedString, it actually appears to allow all characters
> except xA, xD, or x9. If the restriction is going to be restated here, it
> should be restated in full.

There's an erratum pending [2] which will say precisely this.

> (3). The three subtypes of "token" do not allow any whitespace characters in
> the value.  Why is there no supertype for these ("token" would have been a
> good name) that allows any string containing no whitespace characters? I
> would have thought this type would be vastly more useful than most of the
> other built-in subtypes of string.

Good idea -- perhaps we'll add this in 1.1

ht

[1] http://www.w3.org/TR/xmlschema-1/#section-White-Space-Normalization-during-Validation
[2] http://www.w3.org/2001/05/xmlschema-rec-comments.html/#pfitoken
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2002, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/
 [mail really from me _always_ has this .sig -- mail without it is forged spam]
Received on Thursday, 18 July 2002 10:42:37 UTC