- From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
- Date: Tue, 4 Sep 2007 11:06:55 -0600
- To: "Wilson, Kirk D" <Kirk.Wilson@ca.com>
- Cc: "C. M. Sperberg-McQueen" <cmsmcq@acm.org>, <public-sml@w3.org>
On 3 Sep 2007, at 14:16 , Wilson, Kirk D wrote: > I recently came across an article in which the author was being > somewhat critical of XML Schema. One the arguments he adduced was > the following. Consider the following schema: > ... > And the admittedly bogus document: > ... > According to this author, every XML Schema processor (he tried) > reports the document to be valid. After your explanation of schema > validity, I doubt that this author placed a correct interpretation > on the result of his processors. I'm inclined to agree with you. Also, the result reported can depend on how the processor is invoked. XML Schema 1.0 and 1.1 describes several different ways of starting validation, which would lead to different results in this case. Using the terms defined in the XML Schema 1.1 spec, these are described below, under the subheading "Details". As noted there, some common processors use 'lax wild-card validation' as their default, which results [validity] = 'unknown' on the document's root element. > I suspect that validity=”unknown” is actually reported in the > PSVI, and the processors didn’t quite report that fact. But I’m not > clear exactly what the reason for validity being unknown is. (I can > imagine several reasons why this might be so, but I’m not sure which > of my imaginings is the correct one.) Well, in fact, depending on the question one asks, the element <bar/> may have a result with a [validity] property of 'valid', 'invalid', or 'notKnown'. Full details are given below, for the studious, but the short version is: The element <bar/> is valid against xsd:anyType, xsd:string, and a number of other type definitions present in the schema. If you ask a processor to validate the element against any of those types, it should report that the element is valid. The element <bar/> is invalid against xsd:decimal and several other types in the schema. If you ask a processor "is this element valid against this type?" and mention one of those types, the processor should report that it's invalid. No element declarations are present in the schema described; if you ask "is this element valid against the relevant element declaration, if any?", the result will be [validity]='notKnown'. Since there is no element declaration to validate against, no processor can know whether the element is valid against 'the relevant' element declaration or not. > The author concludes “If an application was relying on the W3C XML > Schema validation to screen out incorrect input, it would be in > serious trouble.” I believe your point was that that this > conclusion is unfair. The conclusion might better read, “If an > application was relying on the Schema validation to screen out > “incorrect” [sic] input, the application should have a more > profound understanding of schema validity than this author > apparently has. In particular, the application must have specific > knowledge of the results of the schema validation process.” Yes. One might add that an appliction planning to rely on XSDL validation to screen its input needs to know with some precision what question it wants the processor to answer ("is this document valid?" is not sufficiently precise, and is no more answerable than "am I going to regret trying to process this document?"), and make sure that that is the question it asks, by ensuring it uses the appropriate method of invocation. [Those not interested in detailed analysis may stop reading here.] Details What validation result is returned depends in part on what question is asked; different invocation patterns ask different questions. The XSDL spec identifies several different ways of invoking a validator; processors may support any or all of them, and may support others as well. Note: the descriptions below talk about 'the validation root'; this is the element or attribute at which validation starts. It need not be the root element of a document, although that's a convenient default, and many validators do in fact start validation there by default. I should also note that the labels 'type-driven validation' and so on are supplied by XSDL 1.1; they are not present in 1.0, so you probably won't see them in documentation for XSDL 1.0 processors. type-driven validation At invocation time, the user or application identifies a type definition in the schema, and the validation root is validated against that type. The schema you describe contains all of the built-in types, plus the type {http://www.example.com}foo. One of them needs to be specified, for this method of invocation to work. The element has no attributes, and no content. When validated against the various types available in the schema, it should be valid in some cases. This is true both for the special types (xs:anyType, xs:anySimpleType, and xs:anyAtomicType) and for those built-in simple types whose lexical spaces include the empty string. This second group includes xs:string, xs:hexBinary, xs:base64Binary, xs:anyURI, xs:normalizedString, and (surprisingly, perhaps) the misnamed xs:token. The element would be invalid for all the other built-in types, because their lexical spaces don't include the empty string. I won't list them all here. element-driven validation At invocation time, the user or application identifies an element or attribute declaration in the schema, checks that the expanded name of the validation root matches the {name} and {target namespace} properties of the element declaration, and then validates against the declaration in the usual way. This method of invocation cannot be used with the schema and instance you describe, since the schema contains no element declarations. attribute-driven validation Analogous to the preceding, but using an attribute declaration, not an element declaration, and using an attribute not an element as validation root. This method of invocation cannot be used with the schema and instance you describe, since the schema contains no attribute declarations and the document instance contains no attribute instances. lax wildcard validation This is not infrequently used as the default method of invocation, and I suspect this is the method actually used by the author of the article. The processor validates the validation root as if it had matched a lax wildcard in the governing type definition of its parent. In practice, that means looking for a top-level element declaration (or a top-level attribute declaration, if the validation root is an attribute) whose target namespace and local name properties match the validation root's expanded name. If a declaration is found, the validation root is validated against that declaration; otherwise, it's marked [validity]='notKnown'. In the example, that means the processor looks for an element declaration to match the unqualified name 'bar'. There are no element declarations in the schema, so none is found. The 'bar' element is laxly assessed (this is quick, because it has no children and no attributes to validate), and the PSVI shows the 'bar' element as [validity]='notKnown' and [validation attempted]='none'. strict wildcard validation Just like lax wildcard validation, except that the validation root is validated as if it had matched a strict wildcard, not a lax wildcard. The PSVI for elements and attributes which match strict wildcards is exactly the same as the PSVI for elements and attributes which match lax wildcards. When real wildcards are concerned, the difference shows up in the validity of the parent. When the two different methods of invocation are concerned, the difference shows up in the behavior of the invoker. (If the invoker describes its invocation of the validator as lax or strict wildcard validation, then in the lax case it's signaling that it's all right if no declaration is found; in the strict case, it's signaling that it's NOT all right if no declaration is found. In practice, if I invoke Xerces C on the sample you describe, I get the following results: $ ~/bin/runxercesc foobar.xml Error at file /Users/cmsmcq/2007/schema/exx/foobar.xml, line 4, char 3 Message: Schema in foobar.xsd has a different target namespace from the one specified in the instance document http://www.example.com/myschema. Error at file /Users/cmsmcq/2007/schema/exx/foobar.xml, line 4, char 3 Message: Unknown element 'bar' When I invoke Xerces J, I get: $ ~/bin/runxercesj foobar.xml [Error] foobar.xml:4:3: cvc-elt.1: Cannot find the declaration of element 'bar'. When I invoke Saxon SA, I get: $ ~/bin/saxonsa foobar.xml foobar.xsd Validation error on line 6 column 7 of file:/Users/cmsmcq/2007/ schema/exx/foobar.xml: Cannot validate <bar>: no element declaration available Validation of source document failed Note that none of the error messages uses the term 'invalid', but the use of the term 'error' suggests that for all three of these processors, the default invocation mode used from the sample command-line application is 'strict wildcard validation'. In sum, I think you put your finger on the root of the article's shortcomings: it's assuming a simpler notion of 'validation' than is plausible in the real world. Michael
Received on Tuesday, 4 September 2007 17:06:51 UTC