Issue LC-177 (Tighten conformance rules?)

Murray -

In your review of Part I of the last-call draft of XML Schema, you 
commented among other things on the rules governing XML Schema 
validation and conformance.

>Sec. 6.1 Layer 1: Summary of the schema-validation core 
>
>Another instance of befuddlement. How can this be considered
>acceptable? (hilighting mine):
>
>The obligation of a schema-aware processor as far as the
>schema-validation core is concerned is to implement the definitions of
>schema-valid given below in Schema Validation of Documents (§7.2)
>. Neither the choice of element information item to be
>schema-validated, nor which of three means of initiating validation
>are used, is within the scope of this specification.

...

>Sec. 7.9 Missing Sub-components
>
>I've tried three or four times to write up something about this
>section. Because of my incomplete understanding of the rest of the
>spec it's difficult to confidently summarize, but my reaction in
>general is one of mild shock. I long for the days of 'draconian' error
>handling, and can only attempt to imagine a Web where §7.9 becomes the
>norm for XML processing.

These comments have been included in the XML Schema last-call issues
list [LCI] and assigned issue number LC-177 for tracking purposes.

The XML Schema WG has discussed issue LC-177 this week, and I have
been asked to reply to you, explaining the rationale for the rules as
they exist.  Our review has confirmed that the rules as they are
specified do reflect the consensus of the WG.

The rules, and our reasons for them, are as follows:

  A Within a document, the schemaLoc attribute can be used on any
    element to provide a suggestion for where to locate a (not 'the')
    schema for a particular namespace.  

    (Rationale: there may be any number of documents with a claim to
    be the normative definition of a namespace: prose documentation in
    various languages, formal specifications in DTD, XML Schema, RDF
    Schema, or other syntax, and so on.  There may be multiple
    formalizations of the same namespace -- HTML is a well known
    example.  Some believe that proper support for content negotiation
    in serveers and clients would allow all of these resources to be
    retrievable from the URI which identifies the namespace, but
    content negotiation is currently implemented only imperfectly and
    incompletely by software and incompletely understood by the
    average user.  For these and other reasons, it is not possible --
    and in the view of some, not desirable -- to guarantee that when
    one dereferences a namespace name the result will be an XML Schema
    document.  It is therefore useful to have a safety valve for cases
    where the namespace name cannot be dereferenced, or does not yield
    an XML Schema document when dereferenced.)

  B The schemaLoc attribute is, formally, a *hint*, not an
    instruction.  It may be taken as a claim that a schema for the
    namespace in question may be found at the location indicated.  The
    schema validator is not required to take the hint.  The exact
    method by which a schema validator finds a schema is out of scope
    and system dependent.  We expect schema validators to use
    mechanisms like command-line options and arguments, menus,
    environment variables, and any other user-interface mechanism
    implementors think their users will find helpful.

    (Rationale: if I am receiving data from you, either I trust you or
    I validate the data.  If I don't trust your claim that the
    document is valid, how on earth can I be expected to trust your
    claim that the schema at a given URI is the one we agreed to
    validate against?  I can't be.  So I need to have the right to
    tell the schema processor, "I don't care what the other guy said
    is a good schema, the schema *I* trust for this namespace is right
    *here*."  Since the authoritative word must come from the user,
    not the document, and since we don't want to interfere with user
    interface design, it would be a huge mistake to prescribe a
    particular approach to allowing the user to say where to find
    schemas.  Obviously, a processor can provide a 'trust the
    schemaLoc' option which will work in many cases.)

  C The schemaLoc attribute also constitutes a claim that the relevant
    parts of the document conform to that schema for the namespace in
    question.

    (Rationale: there is a range of opinion about the degree to which
    claims about validity should be expressed, or expressible, in the
    document itself; the view expressed here is a compromise between a
    position which advocates that the document instance be interpreted
    as making somewhat stronger claims, and a position which advocates
    that all such claims be expressed outside the document itself and
    that the meaning of schemaLoc be limited to what is described
    above in item B.  

    The claim that a document is valid vis-a-vis a given schema
    document for a particular namespace is logically distinct from a
    request to validate the document, or from a request that the
    particular schema document be used to validate the elements from
    that namespace in the document: whether the document is validated,
    and if so which schema documents are used, may vary from
    circumstance to circumstance.)

  D The presence of a schemaLoc attribute does *not* constitute a
    request for validation.

    (Rationale: there are many situations in which a document should
    be read, possibly by a processor which understands how to validate
    it, but does not need to be, or SHOULD NOT be, validated.  A
    request for validation is a transaction between a user and a piece
    of software, or between two pieces of software.  It is not a
    declarative fact about a document.  It is best left to a user
    interface.)

  E If more than one schema location is suggested for a particular
    namespace, it is not an error, but no particular priority is
    assigned to the two.

    (Rationale:  they are HINTS, right?)

  F A validation process may start at any element in the document and
    work down.

    (Rationale: Launching a validation process is taken to be a matter
    between a user and a piece of software, or between two pieces of
    software.  It may sometimes be important to validate the entire
    document; sometimes only certain parts of the document need to be
    validated. Since the presence of a schemaLoc attribute does not
    constitute a request for validation (and its absence cannot be
    taken as a binding request *not* to validate), the user is free to
    select any point as the starting point.  It may be expected that
    some schema validators will, by default, start at the top of the
    document.  But it is important that they are not REQUIRED to do
    so.)

  G A validation process may work in strict mode, lax mode, or skip
    mode.  In checking the schema-validity of the document, the
    processor must switch from mode to mode on the basis of the
    {process contents} property on the relevant schema component.

    (Rationale: For some applications, it's essential to check every
    element and every attribute, and to insist that they be declared,
    roughly as in a DTD.  This is strict mode.

    For some applications (black-box applications), it's essential to
    be able to specify that the schema applies only to some outer
    envelope, which contains well-formed XML as a payload, and that
    the payload does not need to conform to the schema and should be
    skipped entirely.  Think of defining an information retrieval
    protocol like Z39.50 as a set of XML messages going back and
    forth.  The envelope needs to conform to the schema, but the
    payload does not need to conform, and it would normally be a waste
    of cycles to try to validate the payload.  This is skip mode.

    For some applications (white box applications), there may be a
    payload which need not be validated, and the elements in it need
    not be declared, but if elements are encountered for which
    declarations *are* available, they should be validated.  In a
    template in an XSL stylesheet, for example, I may not care about
    validating the elements in the target namespace.  (In fact, it is
    highly unlikely that I *can* validate them without writing a
    specialized schema for them: the target schema is unlikely to
    allow <xsl:value-lf> elements in the right places.)  But if I see
    another XSL element inside a target element, I probably do want to
    validate it.  This is 'lax' mode (known informally as
    'opportunistic validation').

    So strict, skip, and lax are each necessary, because each
    describes a plausible approach to validation and to coexistence of
    schemas and namespaces.)

  H In checking schema validity, a validation process must be guided
    by the {process contents} property on the relevant schema
    components, but it NEED NOT restrict itself to checking
    schema-validity only.  For example, a processor may offer an
    option to check all elements strictly, even if the schema only
    requires lax processing.

    (Rationale: the schema may have been devised for skip-processing,
    but for my purposes I may insist on lax or strict processing.  My
    business partners may not care about the contents of the payload,
    but for my purposes I want to know that if the payload contains
    anything that claims to be a purchase order, then it jolly well
    conforms to my schema for purchase orders.)

  I If in the schema the relevant {process contents} property has the
    value 'strict' or 'lax' or 'skip', this may be interpreted as a
    declarative statement that documents which conform to this schema
    must have no errors when processed in the specified mode.  It
    follows that if a schema processor processes a black-box payload
    (declared with processContents='skip') in lax mode, and finds an
    error, the error in question is not a schema-validity error.  

    (Rationale: all schema processors should give the same results, as
    regards schema validity.  If the schema says something should be
    skip-conformant, you do have the right to check it in strict or
    lax mode, but you and your processor do not have the right to call
    failure to conform to the rules of strict or lax mode a schema
    validity error.  Put in other terms: you can define your *own*
    validation property, say [strict validity], and get your processor
    to compute it, but you can't produce a PSV Infoset that records
    strict validity in the [validity] property -- the XML Schema spec
    defines what that property means, and you can't change that.

    As long as the processor distinguishes between failure to conform
    with the restrictions laid out in the schema, and other failures,
    all is well.  You might also want a processor to check to make
    sure the document is in ASCII, not UTF-8 or UTF-16.  That's your
    right, and it's OK.  But the processor is not allowed to claim
    that a UTF-16 document is ill formed on that account.)


I believe that you were mostly surprised and unhappy over rules B and
F; I have included the others partly because I think they help make
the picture more complete, and partly because some of them are
becoming hobbyhorses of mine.  

I hope this description explains both why the rules are as they are,
and why the WG does not feel they should be changed in response to
your desire for stricter rules.  The strict behavior you wish can be
achieved: the user merely needs to specify that the entire document
must be validating using strict validation.  Requiring that all
documents be validated in their entirety, and in the same strict mode,
would replicate the shortcomings of DTDs for describing extensible
markup languages.

Please let me know whether this sufficiently addresses your concerns
about the conformance rules of XML Schema.

best regards,

Michael Sperberg-McQueen

-- 
****************************************************
* C. M. Sperberg-McQueen                           *
* Research Staff, World Wide Web Consortium        *
* Route 1, Box 380A, Espa&ntilde;ola NM 87532-9765 *
* (that's Espanola with an n-tilde)                *
* cmsmcq@acm.org, fax: +1 (505) 747-1424           *
****************************************************

Received on Friday, 30 June 2000 15:02:41 UTC