Re: root element in schema from noah_mendelsohn@us.ibm.com on 2003-04-18 (xmlschema-dev@w3.org from April 2003)

From: <noah_mendelsohn@us.ibm.com>
Date: Fri, 18 Apr 2003 15:09:52 -0400
To: Erwin.Smout@ksz-bcss.fgov.be
Cc: xmlschema-dev@w3.org
Message-ID: <OFF5FCA8AB.A2C6A2F5-ON85256D0C.00510AA7@lotus.com>
Erwin Smout writes:

>> It is perfectly possible to refer to a BOOKLIST.XSD 
>> in a <BOOKLIST> root and refer to a BOOK.XSD in a <BOOK> 
>> root. With proper include-mechanisms in place, there 
>> is little extra effort involved in having these two 
>> different schemas, instead of only one that  allows 
>> different root-element-types. 

Thank you for your comments.  I understand you to be suggesting:  let each 
schema document declare exactly one root, which is to be honored if that 
schema document is referenced explicitly by a schemaLocation in the 
instance, but not if it is the target of an <xsd:include> from another 
schema.  That seems to me to be fragile in a number of dimensions.  First 
of all, there are many, many situations (such as the typical purchase 
order) in which you either can't get a schemaLocation into the instance, 
or in which you wouldn't trust it if it were there.  That's why it's a 
hint.  What do we do for all those instances that can't "name" a schema 
document? 

Furthermore, we've generally declined to have a schema document mean 
something different when it's included than when it's referenced in some 
other manner.   You can wind up with rather tricky scenarios in which the 
same schema document is referenced from multiple places (processor command 
line, schemaLocation in the instance, <xsd:include>).  If the rules for 
root depend on which of these ways you find it, then it becomes a 
constraint that all processors encounter these in the same order.  That 
makes it very hard to build streaming processors that work the same way as 
those that precompile schemas. 

Here's how I think I would design a mechanism to do what I think you want:

* I would add a new boolean property to elementDeclaration to be called 
"okAsDocumentRoot", which could be set to "true" on one or more global 
element declarations.

* I would add a new attribute to the XML form of an element declaration 
allowing <xsd:element name="n" OKAsDocumentRoot="true">.  This would set 
the component property in the obvious manner.

* I would add a new mode of validation: 
- In full document mode, it would only be legal to start validation if the 
element decl that matched the root element had the boolean set to true
- To meet the need for incremental validation (see below), you would have 
an additional validation mode that would ignore the property and allow 
validation to proceed from any global element declaration.  In other 
words, do what we do today.

Is this worthwhile?  I'm not convinced, but I'm not strongly against it 
either.  It's a new property, a new attribute, and a new validation model. 
 What it does is to allow you to mark in a schema document the elements 
that you intend to be a root and to have that checked.  Frankly, most of 
the applications I write know exactly what the root is to be:  if I'm a 
purchasing application, I know perfectly well that the root better be 
"purchaseOrder" and I check that very easily.  There may indeed be other 
examples where the above would be useful, and if there were a groundswell 
of support for it, I wouldn't be opposed.  As I say, we've heard this 
request only occasionally, and I'm not currently convinced it makes the 
80/20 cut we've tried for.

Let me comment briefly on the partial validation question.  Here are a few 
use cases:  let's say you have a purchase order xml format, a fairly 
common example, and it includes a sub element named "shipping address".

<purchaseOrder>
        ....
        <shippingAddress>
                <street> ... </street>
                <city>...</city>
                <state>..</state>
                <zip>...</zip>
        </shippingAddress>
</purchaseOrder>

You are building a shipping application that prints the address lables for 
the items to be shipped.  It's important that some outer application 
(which may have done a schema validation on the PO or may have used some 
other means to make sure that its overall structure is sufficiently 
trustworthy) passes just the shipping address element to the shipping 
application.  That shipping application chooses to use schema validation 
on just the shiuppingAddress element.  That's what I mean by partial 
validation, and it is important for many such application decomposition 
scenarios.  Do I really need to separate the address into a different 
schema document?  There would be lots of them, and it seems to tie my 
processing model unnecessarily to the packaging of the documents.  If a 
book publisher's association wants to publish a vocuabulary for describing 
books, authors, etc., I don't want them to have to think about the 
different fragments of book descriptions or catalog entries that I may 
wish to validate in my applications.  They should just publish a schema 
document to define their namespace and elements, and I should use the ones 
I need.  Not all applications of XML schema are document-oriented.

Another very important scenario is taking that entire purchase order and 
wrapping it in a soap envelope (namespace decls skipped for brevity):

<soap:envelope>
        <soap:body>
                <po:purchaseOrder>
                        ...
                </po:purchaseOrder>
        </soap:body>
<soap:envelope>

Sometimes you want to validate the whole envelope including the purchase 
order.  Sometimes you don't validate the purchase order until it's been 
extracted and handed to some purchasing application.  So, sometimes 
purchaseOrder is the root, sometimes not.

There are also editing scenarios in which an editor gathers the 
information for a document out of order.  While sooner or later the entire 
document may be validated or maybe not, it's very useful to be able to 
validate the fragments as they are gathered.  Similar scenarios come up in 
the design of languages like XML query, which assemble pieces of documents 
dynamically.  It's nice to be able to discuss the validity of those 
fragments in isolation, as well as in the context of an overall document. 

So I hope you can see that, while your scenarios involve a very strong 
notion of "document" and "root", not all do.  The question is whether to 
build a special mechanism to model that, and so-far we've decided that 
it's reasonable on balance to leave such modeling outside of the language. 
 Again, thank you for your comments, and I'm sure I speak for the Schema 
WG in saying that we take to heart your concerns that our current 
mechanisms don't exactly fit your needs.  Thank you.

Noah

------------------------------------------------------------------
Noah Mendelsohn                              Voice: 1-617-693-4036
IBM Corporation                                Fax: 1-617-693-8676
One Rogers Street
Cambridge, MA 02142
------------------------------------------------------------------







Erwin.Smout@ksz-bcss.fgov.be
Sent by: xmlschema-dev-request@w3.org
04/16/2003 07:30 AM

 
        To:     xmlschema-dev@w3.org
        cc: 
        Subject:        root element in schema



Hello,

Recently, I raised an issue here at work regarding global and root 
elements
in xml-schema.  Our xml-specialist did not have an answer immediately, but
later pointed me to a discussion about the subject :
http://lists.w3.org/Archives/Public/xmlschema-dev/2001Jun/0074.html.

I must say I didn't feel comfortable with some statements made there, and
thought I might add my point of view on the subject.



Mr. Mendelsohn states that someone might want to be able to have two
different elements as a root.  I really don't see how this could be a
necessity to anyone.  The root-element itself enables you to name the
schema that rules the xml-document.  It is perfectly possible to refer to 
a
BOOKLIST.XSD in a <BOOKLIST> root and refer to a BOOK.XSD in a <BOOK> 
root.
With proper include-mechanisms in place, there is little extra effort
involved in having these two different schemas, instead of only one that
allows different root-element-types.  So I can't really agree with him
there.  And I totally can't agree with what is said about "partial
validation".  This goes against everything xsd stands for.  I clearly
recall having read the guidelines saying that "a parser should stop 
passing
data from the moment it finds an error.  Furthermore, programs receiving 
an
error-message from a parser should consider all data they already parsed
from the document as non-existant".  This leads me to conclude that "valid
xml" (according to xsd) is (meant to be) an all-or-nothing proposition.
There is no such thing as "partially valid".  And the fact that some
programmer might want to do something like partial validation, is not a
good reason to "accept" this line of thinking.  Programmers have been
interpreting standards and guidelines in this fashion ("I will use what
comes to good use and ignore whatever I don't like") for as long as I
remember (unfortunately).  They have always been and will always stay the
main reason why so many efforts toward standardisation prove useless and
simply fail.

Think about it for a moment.  Two organisations (be it two companies, or a
company and the government, or two departments within a company, or
whatever ...) decide to exchange data about, let's say, "customers" in
xml-format.  They agree on a <customer> root-element which holds several
subordinate elements, <custnr> (mandatory), followed by either a
<legalperson> element, or a <naturalperson> element.  The <legalperson>
contains <name> and <legalform> elements, the <naturalperson> contains
<surname>, <firstname> and <initials> elements.  Now, in this example, if
one side sent an xml-form with only a <firstname>-element (and thus 
without
the customer number), then a validation process based on xsd would not 
mark
this form as "invalid", even though elements which were clearly intended
and declared to be mandatory (<custnr> e.g.), aren't there at all ?  Come
on guys, let's be serious for a moment.

It would seem obvious to me that :
a) a receiving party cannot do anything with just the <firstname> element,
it will always need at least the customer number, before it is able to
perform whatever useful processing it could do with this message.
b) a receiving party would therefore expect its "validation process" to
mark this "<firstname>-only" message as "invalid", because it lacks
essential data.  Rightfully so.
c) If the receiving party cannot rely on xsd to do just that, then what
good is xsd anyway to anybody ?

I think this little example shows clear enough that there is indeed a need
for being able do designate some element as being the root in xmlschema.



Now for how to achieve this ?  To do that, we need some information that
enables us to distinguish between an element that is "global", and which
element(s) is(are) actually present (or possibly present) in the xml
described by the schema.  In fact, these "global" elements apparently 
serve
the purpose of "declaring" the structure of some type of element, not
declaring the (possible) presence of such element in an xml-document.

Apparently, xsd now has two distinct meanings for the <element>-element :
1) as a declaration of a certain type that can be referred to later in the
schema.
2) as a declaration of the possible occurrence of such element in an
xml-document.

To my idea, this is flat out WRONG.  If two distinct sorts of information
are needed (here the "type-declaration" and the "xml-element-declaration",
then they should have different names, or be recognisable as such in
whatever way is appropriate.  The xsd-syntax apparently does not allow
this.  There is no way to determine unambiguously what "meaning" has to be
assigned to an <element> in a schema.  I feel this is a major design error
in the xsd syntax, which should be removed as soon as possible.

Designers do have a way to avoid this problem (by using <simpletype> and
<complextype> for declarations, and using <element> for actual xml-element
description, assigning them type-information by "type=typeref"), but this
is no solution for someone writing a schema-validation process.  The
authors of schema validation processes cannot rely on the fact that every
schema-author will use this method.
Received on Friday, 18 April 2003 15:18:28 UTC