Re: white space in xsd:hexBinary from C. M. Sperberg-McQueen on 2012-01-17 (www-xml-schema-comments@w3.org from January to March 2012)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Mon, 16 Jan 2012 19:25:38 -0700
To: Henry Story <henry.story@bblfish.net>
Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, liam@w3.org, www-xml-schema-comments@w3.org
Message-Id: <1DE43497-5C21-4EAA-A082-48B732790514@blackmesatech.com>
On Jan 16, 2012, at 1:37 PM, Henry Story wrote:
> 
> Here is my question to your group. How is an xsd:hexBinary with a white space
> meant to be interpreted currently?

There are two ways to answer this simple question.

The simple answer is:  "an xsd:hexBinary with a white space" is a
contradiction in terms.  No literal with any whitespace is a member of
the lexical space of xsd:hexBinary.  An element or attribute of type
xsd:hexBinary, whose initial value has internal white space, is 
invalid.

A more complicated answer is necessary if you want to understand the
details -- in particular, if you want to understand what parts of the
journey from an input data stream to a validity judgement are defined
by the XSD spec and which are out of scope and unconstrained by XSD.

I'll try to explain this as simply and fully as I can, identifying the
various places at which the XSD spec leaves some room for differences
of behavior and at which a language lawyer can identify some wiggle
room that might allow you to achieve your goals.

Readers who find language lawyering, casuistry, and hairsplitting
tedious and irritating will want to drink some herb tea before
continuing.

For concreteness, let us consider the literal ' 0F B7'.

The details of what happens with this literal depend upon just where
and how the literal is encountered.  There are several cases to
consider: validation of an XML document using XSD or another schema
language that uses XSD datatypes, and validation of an isolated
literal outside the context of XML validation.

Case 1.  In the context of schema validation of an XML document using
an XSD schema, if ' 0F B7' is the normalized value of an attribute or
element assigned the type xsd:hexBinary by the schema, then

  1 First the XML parser parses the XML document.  Let us imagine the
    XML document contains the element

       <cert:key cert:modulus="
          0F
          B7" 
          cert:exponent=" 65537"/>

    or the element

       <cert:key>
         <rdfs:label>made on 23 November 2011 on my
           laptop</rdfs:label>
         <cert:modulus> 0F <!--* hi, mom! *-->B7</cert:modulus>
         <cert:exponent> 65537 </cert:exponent>
       </cert:key>

    Then (to use the vocabulary of the XML Information Set spec) the
    string ' 0F B7' is the [normalized value] of the attribute
    information item for the cert:modulus attribute.  (Or rather,
    since I'm in language-lawyer mode:  it might be.  Whether the
    [normalized value] is ' 0F B7' or '           0F           B7'
    or '0F B7' depends on whether attribute cert:modulus is declared
    in the DTD, and how.)  

    And the six characters ' ', '0', 'F', ' ', 'B', and '7' are the
    sole character children of the cert:modulus element, and the
    cert:modulus element has no element children.

  2 The validation software creates a representation of an information
    set to validate.  

    In the normal course of events, the infoset to be validated is the
    infoset generated by parsing the XML document we started with.
    But there is nothing to prohibit an XSD validator or other
    software from offering to perform certain modifications to the
    infoset before validating it.  (I believe that some XSD validators
    do some infoset fixup on the output of XInclude processing, before
    validating it, for example.  But it's not a prominent feature of 
    most validators.)

    If, in the input infoset, the string ' 0F B7' occurs as the
    [normalized value] of an attribute, or the six characters in that
    string occur, in order, as character children of an element which
    has no element children and no other character children, then in
    either case the string ' 0F B7' is the 'initial value' handled by
    the XSD processor.
   
  3 To keep things simple, I'll assume that the element or attribute 
    we are dealing with is assigned the type xsd:hexBinary by the 
    schema being used for validation.

    Then in schema validation as defined by the XSD spec, whitespace
    normalization is performed on the initial value.  The whitespace
    facet of xsd:hexBinary has the fixed value 'collapse', so the
    result of whitespace normalization is '0F B7', which is the
    'normalized value' for XSD purposes.

  4 The validation rule Datatype Valid defined by XSD part 2 is
    applied to the normalized value.  That validation rule says (in
    XSD 1.1; 1.0 has a more complicated procedural formulation that is
    intended to amount to the same thing):

        A ·literal· is datatype-valid with respect to a Simple Type
        Definition if and only if it is a member of the ·lexical
        space· of the corresponding datatype.
    
    The literal '0F B7' is not a member of the lexical space of
    xsd:hexBinary, so it's not datatype-valid with respect to that
    simple type definition.

  5 A conforming XSD validator will report, using whatever interface
    it defines, that the element or attribute we started with is not
    schema-valid.

    XSD validators are not required to expose all parts of the
    post-validation infoset, so there is no guarantee that the
    validity of individual elements and attributes will be exposed.
    In practice, however, you'll usually at least get an error message
    pointing to the offending literal, here '0F B7', and identifying
    the type it's not an instance of.

  6 The consuming application will do whatever it chooses to do with
    the information that the input document is not schema-valid.  

    XSD is carefully designed to make it feasible for the consuming
    application to recover gracefully from isolated problems in the
    input.  But most application designers treat validity as an all or
    nothing property and will abort if the input is not valid.  That's
    a choice they make.

Case 2.  If we are validating an XML document using a Relax NG schema
which validates '0F B7' against xsd:hexBinary, then I think steps 2
and 5 may drop out (you should check with a Relax NG expert if it
matters), but steps 1, 3, 4, and 6 apply as before.  (And even though
Relax NG doesn't define specific validity annotations for elements and
attributes as part of its output, you'll usually get an error message
identifying the literal and the datatype where a problem was
encountered.)

In either case 1 or case 2, the best opportunities for making ' 0F B7'
or '0F B7' be accepted as a lexical representation of the two-octet
string 00001111 10110111 are probably in step 2 and (if you have
access to a 1.1 processor like Saxon that provide a suitable
pre-lexical facet) step 3.  

(A cynic might say that there is little difference between a processor
that decides to allow white space in the lexical space of
xsd:hexBinary and thus fails to conform to the XSD spec and introduces
an incompatibility of the kind Liam Quin warns against, and an
XSD-conformant processor which exploits the unconstrained nature of
step 2 and removes whitespace from certain items in the infoset before
validating, except that one defines what it does in simple, clear
terms and the other covers it up with mumbo jumbo.  But far be it from
me to agree with such a cynic.  I am almost never that cynical.)

Case 3.  We are not validating an XML document, so XSD Part 1
(Structures) does not apply.  We are in some other context where
literals are identified and checked against simple types.  Logically,
the following steps apply:

  1 The literal to be validated is identified.  

    Usually this is going to be the sequence of characters found in
    the input (whatever that is), with no funny business.  But XSD
    Part 2 doesn't say anything about that, and funny business is
    certainly feasible here.

    For purposes of our example, I'll assume that the literal
    identified is ' 0F B7'.

  2 If the controlling spec says to apply whitespace normalization as
    determined by the whitespace facet, then that's done.  If the
    controlling spec says not to apply whitespace normalization, then
    it's not done.  (If the controlling spec doesn't say, then it
    probably should be made clearer.)

    I believe I was told some time ago that the relevant RDF specs are
    clear that whitespace normalization is not applied.  That was some
    time ago, and I might have misunderstood, but I don't think so: I
    argued that it was user-friendlier to perform the whitespace
    normalization, but was told the WG had carefully decided not to do
    so.

    In XSD 1.1, any other pre-lexical facets are also applied at this
    time.  (XSD defines no other pre-lexical facets, but other specs
    may.)

    At this point, our literal is either '0F B7' or ' 0F B7'.

  3 Either way, it's not datatype-valid, because neither of those
    forms is a member of the lexical space for xsd:hexBinary.

Here, an external spec can specify whatever pre-processing it likes as
part of step 1.  And other specs that use XSD 1.1 can also define
further pre-lexical facets for xsd:hexBinary that could have the
effect of getting rid of the whitespace.  Step 2 is also a potential
source of help in this situation, if the controlling spec refers to
XSD 1.1 and not to 1.0.  (But if the RDF specs really do forbid the
application of the whitespace facet, the responsible working groups
are probably not going to be eager to define new pre-lexical facets.
Still, you probably know more about the politics of the RDF working
groups than I do.)

I hope this helps.



-- 
****************************************************************
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
* http://www.blackmesatech.com 
* http://cmsmcq.com/mib                 
* http://balisage.net
****************************************************************
Received on Tuesday, 17 January 2012 02:26:15 UTC