Re: LC124: Comment on V2S and [validity]=notKnown from Sandy Gao on 2005-07-22 (www-ws-desc@w3.org from July 2005)

From: Sandy Gao <sandygao@ca.ibm.com>
Date: Fri, 22 Jul 2005 19:36:24 -0400
To: Arthur Ryman <ryman@ca.ibm.com>
Cc: David Orchard <dorchard@bea.com>, Roberto Chinnici <Roberto.Chinnici@Sun.COM>, "Rogers, Tony" <Tony.Rogers@ca.com>, www-ws-desc@w3.org, www-ws-desc-request@w3.org
Message-ID: <OFE295F59A.02159E07-ON85257046.007E3345-85257046.0081ACF5@ca.ibm.com>
Well, the current schema spec doesn't say anything about pruning/ignoring 
things. If there's something unexpected, the parent is invalid. (Yes, the 
parent!)

No in the context of versioning / schema evolution (hoping schema 1.1 has 
support for it), a few comments on Roberto's algorithm:

- It's very similar to one of the mechanisms proposed, where we insert 
implicit (skip?) wildcard everywhere. Combining this with weak (weakened) 
wildcard, you get exactly what's in the algorithm (ignore things not fit 
in the content model). There are a few others that are related and worth 
noting. For example, different approaches vary depending on how much you 
know about the unexpected element: it's from a namespace you don't know 
about? from a different namespace (that you may know)? from the same 
namespace? from the same namespace and matches one of the global element? 
matches one of the elements in the current content model? The basic idea 
is that the more you know about this element, the less tolerant you should 
be with it. Many approach would still treat the enclosed example as 
invalid, because it looks a lot more like an invalid document than a 
result of some schema evolution.

- A trivial fix

> <sequence>
>   <element ref="ad:name" minOccurs="1" maxOccurs="unbounded"/>
>   <element ref="nad:country" minOccurs="0"/>
> </sequence>

Turning this to a DFA:

state  final  "name"   "country"
0      N       1        x
1      Y       1        2
2      Y       x        x

> <ad:name>fred</ad:name>
> <nad:country>Australia</nad:country>
> <ad:name>bill</ad:name>

Following Roberto's algorithm, when we see <nad:country>, we are at state 
1, and the allowed set is {ad:name, nad:country}, which means this element 
is allowed. When we see the second <ad:element> ("bill"), we are at state 
2, and the allowed set is empty. Following the algorithm, this second 
<name> should be ignored, hence the document is treated as if

> <ad:name>fred</ad:name>
> <nad:country>Australia</nad:country>

- Current schema spec isn't in support of any validity=notKnown approach

Should have signalled this the last time I responded on this issue. My 
apologies...

Even though most people think about content validation as matching a state 
machine (which is most/many validators do), what the spec actually says is 
that it's done in 2 steps:
a. content validation for the containing element: to match the sequence of 
element qnames (from the instance) with the complex type's particle. The 
result will be whether the containing element is valid in terms of its 
content. A side effect of this process is the association between sub 
elements and declarations that will be used to validate them.
b. use the declarations from the previous step to validate the sub 
elements.

Strictly following this, if you have unexpected element in the content, 
step (a) would fail, the parent is flagged as invalid, and there is *NO* 
association between sub elements and declarations! (In schema terms, there 
is no *context-determined* declarations for the sub elements.)

So what *should* be produced in PSVI for the above example per schema 1.0 
rules is that the parent <shipTo> is marked invalid, while the 
sub-elements are validated using the global declarations (because there is 
no context determined declarations, they are treated in the same way as 
for the root element). So [validity] for all the sub-elements will be 
"valid" as opposed to "invalid" or "notKnown".

Again, we are hoping to fix/improve this in schema 1.1. One possibility is 
to say that the association goes as far as the state machine matching ...

Thanks,
Sandy Gao
XML Parser Development, IBM Canada
(1-905) 413-3255
sandygao@ca.ibm.com




Arthur Ryman/Toronto/IBM
07/22/2005 02:35 PM

To
Roberto Chinnici <Roberto.Chinnici@Sun.COM>, Sandy Gao/Toronto/IBM@IBMCA
cc
David Orchard <dorchard@bea.com>, "Rogers, Tony" <Tony.Rogers@ca.com>, 
www-ws-desc@w3.org, www-ws-desc-request@w3.org
Subject
Re: LC124: Comment on V2S and [validity]=notKnown





Roberto,

That seems too good to be true. I'd need to plunge back into the schema 
spec for another day to be confident.

Sandy, what's your opinion? Can we prune out unexpected content this way?

Arthur Ryman,
Rational Desktop Tools Development

phone: +1-905-413-3077, TL 969-3077
assistant: +1-905-413-2411, TL 969-2411
fax: +1-905-413-4920, TL 969-4920
mobile: +1-416-939-5063, text: 4169395063@fido.ca
intranet: http://labweb.torolab.ibm.com/DRY6/



Roberto Chinnici <Roberto.Chinnici@Sun.COM> 
Sent by: www-ws-desc-request@w3.org
07/20/2005 09:19 PM

To
"Rogers, Tony" <Tony.Rogers@ca.com>
cc
David Orchard <dorchard@bea.com>, www-ws-desc@w3.org
Subject
Re: LC124: Comment on V2S and [validity]=notKnown







Rogers, Tony wrote:
> One of the "interesting" aspects of the problem is that we must solve is 

> how we decide on the interpretation of ambiguous results.
> 
> For example, it will be legal to take your example:
> 
> 
> <type name="shipto">
> 
> <sequence>
> 
> <element ref="ad:name" minOccurs="1" maxOccurs="unbounded"/>
> 
> <element ref="nad:country" minOccurs="0"/>
> 
> </sequence>
> 
> </type>
> 
> 
> 
> (yes, I meant to change that to minOccurs)
> 
> 
> 
> and feed it data like:
> 
> 
> 
> <shipto>
> 
> <ad:name>fred</ad:name>
> 
> <nad:country>Australia</nad:country>
> 
> <ad:name>bill</ad:name>
> 
> </shipto>
> 
> 
> 
> which can legitimately be interpreted (after ignorance has been applied) 
as:
> 
> 
> 
> <shipto>
> 
> <ad:name>fred</ad:name>
> 
> <ad:name>bill</ad:name>
> 
> </shipto>
> 
> 
> 
> OR
> 
> 
> 
> <shipto>
> 
> <ad:name>fred</ad:name>
> 
> <nad:country>Australia</nad:country>
> 
> </shipto>
> 
> 
> 
> The latter is my expected interpretation (and may well be the easier to 
> program), but the former is legitimate (it takes the approach of 
> grabbing as many ad:name elements as it can, and it still satisfies the 
> schema).
> 
> 
> 
> What do other people think?

I tend to go with the first interpretation.

Here's how I'd define the "ignore unexpected" rule. This definition is
not phrased directly in terms of XML Schema, and I don't claim that it
would be trivial to do so, quite the contrary. Nevertheless, it seems
compatible with it; if anybody thinks otherwise, please point out where
I'm wrong.

That scourge of all schema authors, the UPA rule, was introduced to
make sure the schema was determistic. I assume then that at any
given stage during the parsing of the contents of an element, the
set of start tages that can legally be encountered is determined
and each tag in that set is associated with exactly one transition
to a new state. (I believe we can safely ignore character content
for the purposes of our discussion.)

Note that the set above, or better the set of names of all start tags
that can be encountered at any given state, may be infinite due to
the presence of a wildcard. This doesn't cause any problems -- all
we need is that the characteristic function of this set be computable.
Off the top of my head, I don't think that substitution groups would
be an issue either, they just make the construction of the set more
complex, nor would xsi:schemaLocation.

Now, the "ignore unexpected" rule is defined as saying that if at
a given state the processor encounter a start tag for an element
whose name is not in the set of expected start tags for that state,
the element is discarded. Subsequently, the processor keeps
operating in the same state it was into (where would it transition
to otherwise?), as if the discarded element had never been there.

Surely there are a few more tweaks that we need to do, like requiring
for some special treatment for the root element of a document and
dealing with attributes, but I hope that the definition I proposed
is clear enough.

If we apply it to the example then, we obtain that

<shipto>
   <ad:name>fred</ad:name>
   <nad:country>Australia</nad:country>
   <ad:name>bill</ad:name>
</shipto>

will be treated as

<shipto>
   <ad:name>fred</ad:name>
   <ad:name>bill</ad:name>
</shipto>

Let's look at a slightly more interesting example.
Assume the following schema: (note the maxOccurs="2")

<type name="shipto">
   <sequence>
     <element ref="ad:name" minOccurs="1" maxOccurs="2"/>
     <element ref="nad:country" minOccurs="0"/>
   </sequence>
</type>

Then this document:

<shipto>
   <ad:name>fred</ad:name>
   <nad:country>Australia</nad:country>
   <ad:name>bill</ad:name>
   <nad:country>New Zealand</nad:country>
   <ad:name>jim</ad:name>
</shipto>

will be treated as:

<shipto>
   <ad:name>fred</ad:name>
   <ad:name>bill</ad:name>
</shipto>

Thanks,
Roberto
Received on Friday, 22 July 2005 23:36:50 UTC