Re: determine root element in the xml from schema from C. M. Sperberg-McQueen on 2004-03-07 (xmlschema-dev@w3.org from March 2004)

From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
Date: 07 Mar 2004 19:36:35 +0000
To: Mik Lernout <mik@futurestreet.org>
Cc: Michael Kay <mhk@mhk.me.uk>, 'Lingzhi Zhang' <lzhang@cse.ogi.edu>, 'dev xmlschema' <xmlschema-dev@w3.org>
Message-Id: <1078688194.2573.22.camel@localhost>
On Sun, 2004-03-07 at 12:10, Mik Lernout wrote:
> I do agree with Stephen/Lingzhi Zhang here: in a normal use case of 
> XMLSchema you will want to confine validation to only one valid 
> root-element. 

I agree that this is a normal use case.  It is perhaps
important to point out, however, that the opposite is
equally normal; historically there are certainly examples
of document type definitions or schemas designed to allow
multiple choices of root elements.  (The Text Encoding
Initiative, to name one concrete example, defines both 
a "TEI.2" element and a "teiCorpus.2" element.  The
XHTML Modularization specification defines numerous HTML
modules which are intended to be independently usable.)

In not identifying a single root element, a schema conforming
to W3C XML Schema resembles not so much a context-free document 
grammar as the set of vocabularies and production rules which 
make up part of such a grammar. For what it's worth, this was 
a conscious design choice on the part of the working group.  The
analogy with document type definitions seemed more relevant
than the analogy with context-free grammars defined as a tuple
of terminal vocabulary, non-terminal vocabulary, start symbol,
and set of production rules.

> The only reason to register multiple global elements would 
> be to be able to use them when importing/including the schema, when 
> refering to the element from within the schema, ... There is a big 
> secuity / application integrity aspect that is touched here as well: it 
> is pretty typical for applications to use XMLSchema for validation and 
> it would be very easy to bypass this validation completetly by using a 
> root element that is also registered globally but not "intended" to be 
> used as a root element.

If it is important to start with a specific root, it should
be possible to invoke the schema processor in such a way as
to specify the element declaration which must match the
root element, as described in section 5.2 of part 1 of the
schema spec.  Security concerns are certainly one reason one
might wish to invoke the processor in such a way.

> Michael: I agree with you that a schema should be able to match more 
> than one instance document, but I do also believe that it should only be 
> able to match only one "type" of instance document. If you have a look 
> at the "Purchase Order Schema" in the primer spec, do you think it is 
> the intention of the schema writer to be able to validate 
> "<comment>abc</comment>" as a valid instance of this schema? Or that the 
> application that will validate this will be constructed to be able to 
> cope with this instance?

Let us hope that the purchase-order application knows it's 
looking for a valid purchaseOrder element, not just any element 
valid against the schema.

Formally, I'm not sure the XML Schema spec defines the term
"X is a valid instance of schema Y"; to the extent that it is
the Working Group which is responsible for the sample purchase
order specification, I can say I don't think the WG has any
problem with a document with root element of 'comment' which is
valid against the schema.  It won't be a very interesting
document, but the schema can certainly be used to validate it.

Perhaps the tutorial should mention the fact, in order to
make people aware that they do need to check the type of the
root element.

> Maybe I am completely off-base/confused here and this kind of 
> "unpredictable behaviour" is intended by the creators of the spec, but 
> then don't we have a serious communication problem in how the spec is 
> being read by the people who are writing XMLSchema validators and 
> applications? If this would be the case it would seem for example 
> logically to be able to mark, when validating, the root-element you wish 
> to validate against like in: validator.validate(po.xsd, 
> 'purchaseOrder'). Why isn't this the case?

I believe it is the case with at least some processors; to the
extent that it's not the case with others, I suspect that not
enough paying customers have made clear they want the capacity
to specify any of the options outlined in section 5.2:  where 
to start validation (doesn't need to be at the root), what element
declaration to start with, what complex type definition to start
with.

Just my two cents,

-C. M. Sperberg-McQueen
 World Wide Web Consortium / MIT CSAIL
Received on Sunday, 7 March 2004 21:38:24 UTC