Re: Whitespace normalization for union types

Consider:

 <xsd:simpleType name="fooType">
  <xsd:union memberTypes="xsd:string xsd:token"/>
 </xsd:simpleType>

 <xsd:simpleType name="fooSubType>
  <xsd:restriction base="fooType">
   <xsd:pattern value="[a-z]"/>
  </xsd:restriction>
 </xsd:simpleType>

 <xsd:element name="foo" type="fooSubType"/>

Wrt this instance:
<foo> a   </foo>

I think on balance Kasimier and Xan are _both_ right, and therefore
none of the processors are wrong.

Here's my reasoning:

[First, it has to be noted that the definition of Datatype Valid [1]
is broken -- it implies that if there's a *pattern* facet, the string
being checked need not be in the lexical space of the type!]

One the one hand the process of validation of a restricted union
could be understood in two steps -- checking the union, and then
enforcing the restriction.  This is because without checking the
union, we don't know what the string _is_, because the only way we get
a string to check is by using a type defn with a whitespace facet.

On this account (Kasimier's too, I guess) things go like this:

1) Check Datatype Valid for the pre-lexical form wrt each member of
   the union in turn:
     " a   " against xs:string -- whiteSpace is preserve, so
                                  lexical form is " a   ", which
                                  _is_ in the lexical space of
                                  xs:string, and the corresponding
                                  value is in the value space of
                                  xs:string, so we win

2) Check the facets of the union:
     " a   " against [a-z] -- fails

So, invalid.

The alternative reading is that the facets on the union are
distributed into the member types of the union, in which case Xan's
analysis is correct and things go like this:

1) Check Datatype Valid for the pre-lexical form wrt each member of
   the union, plus the facets on the union itself, in turn:

1a)  " a   " against xs:string -- whiteSpace is preserve, so
                                  lexical form is " a   ", which
                                  _is_ in the lexical space of
                                  xs:string, and the corresponding
                                  value is in the value space of
                                  xs:string, so we check the facets
                                  check  " a   " against [a-z] -- fails
1b)  " a   " against xs:token  -- whiteSpace is collapse, so
                                  lexical form is "a", which
                                  _is_ in the lexical space of
                                  xs:token, and the corresponding
                                  value is in the value space of
                                  xs:token, so we check the facets
                                  check  "a" against [a-z] -- succeeds

I don't believe it's actually at all clear which is correct.

This actually interacts with an existing issue, regarding the
semantics of a type allowed as the type of e.g. an attribute as part
of a complex type derived by restriction from a base type with a
restricted union for that attribute (whew!) -- example:

 <xs:complexType name="base">
  <xs:attribute name="foo" type="fooSubType"/>
 </xs:complexType>

 <xs:complexType name="restr">
  <xs:attribute name="foo" type="xs:token"/>
 </xs:complexType>

Currently this is a) allowed but b) means that the restricted type
allows _more_ than the base type, which is not supposed to happen.

We should probably solve both these problems together (and the latter
issue suggests we'll go in Xan's direction, that is, we'll push the
facets down onto all the member types. . .)

ht

[1] http://www.w3.org/TR/xmlschema-2/#defn-validation-rules
-- 
 Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
                     Half-time member of W3C Team
    2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
            Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
                   URL: http://www.ltg.ed.ac.uk/~ht/
[mail really from me _always_ has this .sig -- mail without it is forged spam]

Received on Thursday, 2 June 2005 12:10:12 UTC