- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Mon, 16 Jan 2012 19:25:38 -0700
- To: Henry Story <henry.story@bblfish.net>
- Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, liam@w3.org, www-xml-schema-comments@w3.org
On Jan 16, 2012, at 1:37 PM, Henry Story wrote: > > Here is my question to your group. How is an xsd:hexBinary with a white space > meant to be interpreted currently? There are two ways to answer this simple question. The simple answer is: "an xsd:hexBinary with a white space" is a contradiction in terms. No literal with any whitespace is a member of the lexical space of xsd:hexBinary. An element or attribute of type xsd:hexBinary, whose initial value has internal white space, is invalid. A more complicated answer is necessary if you want to understand the details -- in particular, if you want to understand what parts of the journey from an input data stream to a validity judgement are defined by the XSD spec and which are out of scope and unconstrained by XSD. I'll try to explain this as simply and fully as I can, identifying the various places at which the XSD spec leaves some room for differences of behavior and at which a language lawyer can identify some wiggle room that might allow you to achieve your goals. Readers who find language lawyering, casuistry, and hairsplitting tedious and irritating will want to drink some herb tea before continuing. For concreteness, let us consider the literal ' 0F B7'. The details of what happens with this literal depend upon just where and how the literal is encountered. There are several cases to consider: validation of an XML document using XSD or another schema language that uses XSD datatypes, and validation of an isolated literal outside the context of XML validation. Case 1. In the context of schema validation of an XML document using an XSD schema, if ' 0F B7' is the normalized value of an attribute or element assigned the type xsd:hexBinary by the schema, then 1 First the XML parser parses the XML document. Let us imagine the XML document contains the element <cert:key cert:modulus=" 0F B7" cert:exponent=" 65537"/> or the element <cert:key> <rdfs:label>made on 23 November 2011 on my laptop</rdfs:label> <cert:modulus> 0F <!--* hi, mom! *-->B7</cert:modulus> <cert:exponent> 65537 </cert:exponent> </cert:key> Then (to use the vocabulary of the XML Information Set spec) the string ' 0F B7' is the [normalized value] of the attribute information item for the cert:modulus attribute. (Or rather, since I'm in language-lawyer mode: it might be. Whether the [normalized value] is ' 0F B7' or ' 0F B7' or '0F B7' depends on whether attribute cert:modulus is declared in the DTD, and how.) And the six characters ' ', '0', 'F', ' ', 'B', and '7' are the sole character children of the cert:modulus element, and the cert:modulus element has no element children. 2 The validation software creates a representation of an information set to validate. In the normal course of events, the infoset to be validated is the infoset generated by parsing the XML document we started with. But there is nothing to prohibit an XSD validator or other software from offering to perform certain modifications to the infoset before validating it. (I believe that some XSD validators do some infoset fixup on the output of XInclude processing, before validating it, for example. But it's not a prominent feature of most validators.) If, in the input infoset, the string ' 0F B7' occurs as the [normalized value] of an attribute, or the six characters in that string occur, in order, as character children of an element which has no element children and no other character children, then in either case the string ' 0F B7' is the 'initial value' handled by the XSD processor. 3 To keep things simple, I'll assume that the element or attribute we are dealing with is assigned the type xsd:hexBinary by the schema being used for validation. Then in schema validation as defined by the XSD spec, whitespace normalization is performed on the initial value. The whitespace facet of xsd:hexBinary has the fixed value 'collapse', so the result of whitespace normalization is '0F B7', which is the 'normalized value' for XSD purposes. 4 The validation rule Datatype Valid defined by XSD part 2 is applied to the normalized value. That validation rule says (in XSD 1.1; 1.0 has a more complicated procedural formulation that is intended to amount to the same thing): A ·literal· is datatype-valid with respect to a Simple Type Definition if and only if it is a member of the ·lexical space· of the corresponding datatype. The literal '0F B7' is not a member of the lexical space of xsd:hexBinary, so it's not datatype-valid with respect to that simple type definition. 5 A conforming XSD validator will report, using whatever interface it defines, that the element or attribute we started with is not schema-valid. XSD validators are not required to expose all parts of the post-validation infoset, so there is no guarantee that the validity of individual elements and attributes will be exposed. In practice, however, you'll usually at least get an error message pointing to the offending literal, here '0F B7', and identifying the type it's not an instance of. 6 The consuming application will do whatever it chooses to do with the information that the input document is not schema-valid. XSD is carefully designed to make it feasible for the consuming application to recover gracefully from isolated problems in the input. But most application designers treat validity as an all or nothing property and will abort if the input is not valid. That's a choice they make. Case 2. If we are validating an XML document using a Relax NG schema which validates '0F B7' against xsd:hexBinary, then I think steps 2 and 5 may drop out (you should check with a Relax NG expert if it matters), but steps 1, 3, 4, and 6 apply as before. (And even though Relax NG doesn't define specific validity annotations for elements and attributes as part of its output, you'll usually get an error message identifying the literal and the datatype where a problem was encountered.) In either case 1 or case 2, the best opportunities for making ' 0F B7' or '0F B7' be accepted as a lexical representation of the two-octet string 00001111 10110111 are probably in step 2 and (if you have access to a 1.1 processor like Saxon that provide a suitable pre-lexical facet) step 3. (A cynic might say that there is little difference between a processor that decides to allow white space in the lexical space of xsd:hexBinary and thus fails to conform to the XSD spec and introduces an incompatibility of the kind Liam Quin warns against, and an XSD-conformant processor which exploits the unconstrained nature of step 2 and removes whitespace from certain items in the infoset before validating, except that one defines what it does in simple, clear terms and the other covers it up with mumbo jumbo. But far be it from me to agree with such a cynic. I am almost never that cynical.) Case 3. We are not validating an XML document, so XSD Part 1 (Structures) does not apply. We are in some other context where literals are identified and checked against simple types. Logically, the following steps apply: 1 The literal to be validated is identified. Usually this is going to be the sequence of characters found in the input (whatever that is), with no funny business. But XSD Part 2 doesn't say anything about that, and funny business is certainly feasible here. For purposes of our example, I'll assume that the literal identified is ' 0F B7'. 2 If the controlling spec says to apply whitespace normalization as determined by the whitespace facet, then that's done. If the controlling spec says not to apply whitespace normalization, then it's not done. (If the controlling spec doesn't say, then it probably should be made clearer.) I believe I was told some time ago that the relevant RDF specs are clear that whitespace normalization is not applied. That was some time ago, and I might have misunderstood, but I don't think so: I argued that it was user-friendlier to perform the whitespace normalization, but was told the WG had carefully decided not to do so. In XSD 1.1, any other pre-lexical facets are also applied at this time. (XSD defines no other pre-lexical facets, but other specs may.) At this point, our literal is either '0F B7' or ' 0F B7'. 3 Either way, it's not datatype-valid, because neither of those forms is a member of the lexical space for xsd:hexBinary. Here, an external spec can specify whatever pre-processing it likes as part of step 1. And other specs that use XSD 1.1 can also define further pre-lexical facets for xsd:hexBinary that could have the effect of getting rid of the whitespace. Step 2 is also a potential source of help in this situation, if the controlling spec refers to XSD 1.1 and not to 1.0. (But if the RDF specs really do forbid the application of the whitespace facet, the responsible working groups are probably not going to be eager to define new pre-lexical facets. Still, you probably know more about the politics of the RDF working groups than I do.) I hope this helps. -- **************************************************************** * C. M. Sperberg-McQueen, Black Mesa Technologies LLC * http://www.blackmesatech.com * http://cmsmcq.com/mib * http://balisage.net ****************************************************************
Received on Tuesday, 17 January 2012 02:26:15 UTC