Re: [w3c sml] A question on "schema validity" (for MSM) from C. M. Sperberg-McQueen on 2007-09-04 (public-sml@w3.org from September 2007)

From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
Date: Tue, 4 Sep 2007 11:06:55 -0600
To: "Wilson, Kirk D" <Kirk.Wilson@ca.com>
Cc: "C. M. Sperberg-McQueen" <cmsmcq@acm.org>, <public-sml@w3.org>
Message-Id: <6E437C1B-09C2-4E9C-B331-324BB7EB28B2@acm.org>
On 3 Sep 2007, at 14:16 , Wilson, Kirk D wrote:

 > I recently came across an article in which the author was being
 > somewhat critical of XML Schema.  One the arguments he adduced was
 > the following.  Consider the following schema:
 > ...
 > And the admittedly bogus document:
 > ...
 > According to this author, every XML Schema processor (he tried)
 > reports the document to be valid.  After your explanation of schema
 > validity, I doubt that this author placed a correct interpretation
 > on the result of his processors.

I'm inclined to agree with you.  Also, the result reported can depend
on how the processor is invoked.  XML Schema 1.0 and 1.1 describes
several different ways of starting validation, which would lead to
different results in this case.

Using the terms defined in the XML Schema 1.1 spec, these are
described below, under the subheading "Details".  As noted there, some
common processors use 'lax wild-card validation' as their default,
which results [validity] = 'unknown' on the document's root element.

 > I suspect that validity=”unknown” is actually reported in the
 > PSVI, and the processors didn’t quite report that fact.  But I’m not
 > clear exactly what the reason for validity being unknown is.  (I can
 > imagine several reasons why this might be so, but I’m not sure which
 > of my imaginings is the correct one.)

Well, in fact, depending on the question one asks, the element <bar/>
may have a result with a [validity] property of 'valid', 'invalid', or
'notKnown'.  Full details are given below, for the studious, but the
short version is:

   The element <bar/> is valid against xsd:anyType, xsd:string, and a
   number of other type definitions present in the schema.  If you ask
   a processor to validate the element against any of those types, it
   should report that the element is valid.

   The element <bar/> is invalid against xsd:decimal and several other
   types in the schema.  If you ask a processor "is this element valid
   against this type?" and mention one of those types, the processor
   should report that it's invalid.

   No element declarations are present in the schema described; if
   you ask "is this element valid against the relevant element
   declaration, if any?", the result will be [validity]='notKnown'.
   Since there is no element declaration to validate against, no
   processor can know whether the element is valid against 'the
   relevant' element declaration or not.


 > The author concludes “If an application was relying on the W3C XML
 > Schema validation to screen out incorrect input, it would be in
 > serious trouble.” I believe your point was that that this
 > conclusion is unfair.  The conclusion might better read, “If an
 > application was relying on the Schema validation to screen out
 > “incorrect” [sic] input, the application should have a more
 > profound understanding of schema validity than this author
 > apparently has.  In particular, the application must have specific
 > knowledge of the results of the schema validation process.”

Yes.

One might add that an appliction planning to rely on XSDL validation
to screen its input needs to know with some precision what question
it wants the processor to answer ("is this document valid?" is not
sufficiently precise, and is no more answerable than "am I going to
regret trying to process this document?"), and make sure that that is
the question it asks, by ensuring it uses the appropriate method
of invocation.

[Those not interested in detailed analysis may stop reading here.]


Details

What validation result is returned depends in part on what question is
asked; different invocation patterns ask different questions.  The
XSDL spec identifies several different ways of invoking a validator;
processors may support any or all of them, and may support others as
well.

Note: the descriptions below talk about 'the validation root'; this is
the element or attribute at which validation starts.  It need not be
the root element of a document, although that's a convenient default,
and many validators do in fact start validation there by default.

I should also note that the labels 'type-driven validation' and so on
are supplied by XSDL 1.1; they are not present in 1.0, so you probably
won't see them in documentation for XSDL 1.0 processors.

type-driven validation

     At invocation time, the user or application identifies a type
     definition in the schema, and the validation root is validated
     against that type.

     The schema you describe contains all of the built-in types, plus
     the type {http://www.example.com}foo.  One of them needs to be
     specified, for this method of invocation to work.

     The element has no attributes, and no content.  When validated
     against the various types available in the schema, it should be
     valid in some cases.  This is true both for the special types
     (xs:anyType, xs:anySimpleType, and xs:anyAtomicType) and for those
     built-in simple types whose lexical spaces include the empty
     string.  This second group includes xs:string, xs:hexBinary,
     xs:base64Binary, xs:anyURI, xs:normalizedString, and
     (surprisingly, perhaps) the misnamed xs:token.

     The element would be invalid for all the other built-in types,
     because their lexical spaces don't include the empty string.  I
     won't list them all here.


element-driven validation

     At invocation time, the user or application identifies an element
     or attribute declaration in the schema, checks that the expanded
     name of the validation root matches the {name} and {target
     namespace} properties of the element declaration, and then
     validates against the declaration in the usual way.

     This method of invocation cannot be used with the schema and
     instance you describe, since the schema contains no element
     declarations.


attribute-driven validation

     Analogous to the preceding, but using an attribute declaration,
     not an element declaration, and using an attribute not an element
     as validation root.

     This method of invocation cannot be used with the schema and
     instance you describe, since the schema contains no attribute
     declarations and the document instance contains no attribute
     instances.


lax wildcard validation

     This is not infrequently used as the default method of invocation,
     and I suspect this is the method actually used by the author of
     the article.

     The processor validates the validation root as if it had matched a
     lax wildcard in the governing type definition of its parent.  In
     practice, that means looking for a top-level element declaration
     (or a top-level attribute declaration, if the validation root is
     an attribute) whose target namespace and local name properties
     match the validation root's expanded name.  If a declaration is
     found, the validation root is validated against that declaration;
     otherwise, it's marked [validity]='notKnown'.

     In the example, that means the processor looks for an element
     declaration to match the unqualified name 'bar'.  There are no
     element declarations in the schema, so none is found.  The 'bar'
     element is laxly assessed (this is quick, because it has no
     children and no attributes to validate), and the PSVI shows the
     'bar' element as [validity]='notKnown' and [validation
     attempted]='none'.


strict wildcard validation

     Just like lax wildcard validation, except that the validation root
     is validated as if it had matched a strict wildcard, not a lax
     wildcard.

     The PSVI for elements and attributes which match strict wildcards
     is exactly the same as the PSVI for elements and attributes which
     match lax wildcards.  When real wildcards are concerned, the
     difference shows up in the validity of the parent.  When the two
     different methods of invocation are concerned, the difference
     shows up in the behavior of the invoker.  (If the invoker
     describes its invocation of the validator as lax or strict
     wildcard validation, then in the lax case it's signaling that it's
     all right if no declaration is found; in the strict case, it's
     signaling that it's NOT all right if no declaration is found.


In practice, if I invoke Xerces C on the sample you describe, I get
the following results:

   $ ~/bin/runxercesc foobar.xml

   Error at file /Users/cmsmcq/2007/schema/exx/foobar.xml, line 4,  
char 3
     Message: Schema in foobar.xsd has a different target namespace
     from the one specified in the instance document
     http://www.example.com/myschema.

   Error at file /Users/cmsmcq/2007/schema/exx/foobar.xml, line 4,  
char 3
     Message: Unknown element 'bar'

When I invoke Xerces J, I get:

   $ ~/bin/runxercesj foobar.xml
   [Error] foobar.xml:4:3: cvc-elt.1: Cannot find the declaration of
     element 'bar'.

When I invoke Saxon SA, I get:

   $ ~/bin/saxonsa foobar.xml foobar.xsd
   Validation error on line 6 column 7 of file:/Users/cmsmcq/2007/ 
schema/exx/foobar.xml:
     Cannot validate <bar>: no element declaration available
   Validation of source document failed

Note that none of the error messages uses the term 'invalid', but the
use of the term 'error' suggests that for all three of these
processors, the default invocation mode used from the sample
command-line application is 'strict wildcard validation'.

In sum, I think you put your finger on the root of the article's
shortcomings: it's assuming a simpler notion of 'validation' than is
plausible in the real world.

Michael
Received on Tuesday, 4 September 2007 17:06:51 UTC