- From: Sandy Gao <sandygao@ca.ibm.com>
- Date: Fri, 22 Jul 2005 19:36:24 -0400
- To: Arthur Ryman <ryman@ca.ibm.com>
- Cc: David Orchard <dorchard@bea.com>, Roberto Chinnici <Roberto.Chinnici@Sun.COM>, "Rogers, Tony" <Tony.Rogers@ca.com>, www-ws-desc@w3.org, www-ws-desc-request@w3.org
- Message-ID: <OFE295F59A.02159E07-ON85257046.007E3345-85257046.0081ACF5@ca.ibm.com>
Well, the current schema spec doesn't say anything about pruning/ignoring
things. If there's something unexpected, the parent is invalid. (Yes, the
parent!)
No in the context of versioning / schema evolution (hoping schema 1.1 has
support for it), a few comments on Roberto's algorithm:
- It's very similar to one of the mechanisms proposed, where we insert
implicit (skip?) wildcard everywhere. Combining this with weak (weakened)
wildcard, you get exactly what's in the algorithm (ignore things not fit
in the content model). There are a few others that are related and worth
noting. For example, different approaches vary depending on how much you
know about the unexpected element: it's from a namespace you don't know
about? from a different namespace (that you may know)? from the same
namespace? from the same namespace and matches one of the global element?
matches one of the elements in the current content model? The basic idea
is that the more you know about this element, the less tolerant you should
be with it. Many approach would still treat the enclosed example as
invalid, because it looks a lot more like an invalid document than a
result of some schema evolution.
- A trivial fix
> <sequence>
> <element ref="ad:name" minOccurs="1" maxOccurs="unbounded"/>
> <element ref="nad:country" minOccurs="0"/>
> </sequence>
Turning this to a DFA:
state final "name" "country"
0 N 1 x
1 Y 1 2
2 Y x x
> <ad:name>fred</ad:name>
> <nad:country>Australia</nad:country>
> <ad:name>bill</ad:name>
Following Roberto's algorithm, when we see <nad:country>, we are at state
1, and the allowed set is {ad:name, nad:country}, which means this element
is allowed. When we see the second <ad:element> ("bill"), we are at state
2, and the allowed set is empty. Following the algorithm, this second
<name> should be ignored, hence the document is treated as if
> <ad:name>fred</ad:name>
> <nad:country>Australia</nad:country>
- Current schema spec isn't in support of any validity=notKnown approach
Should have signalled this the last time I responded on this issue. My
apologies...
Even though most people think about content validation as matching a state
machine (which is most/many validators do), what the spec actually says is
that it's done in 2 steps:
a. content validation for the containing element: to match the sequence of
element qnames (from the instance) with the complex type's particle. The
result will be whether the containing element is valid in terms of its
content. A side effect of this process is the association between sub
elements and declarations that will be used to validate them.
b. use the declarations from the previous step to validate the sub
elements.
Strictly following this, if you have unexpected element in the content,
step (a) would fail, the parent is flagged as invalid, and there is *NO*
association between sub elements and declarations! (In schema terms, there
is no *context-determined* declarations for the sub elements.)
So what *should* be produced in PSVI for the above example per schema 1.0
rules is that the parent <shipTo> is marked invalid, while the
sub-elements are validated using the global declarations (because there is
no context determined declarations, they are treated in the same way as
for the root element). So [validity] for all the sub-elements will be
"valid" as opposed to "invalid" or "notKnown".
Again, we are hoping to fix/improve this in schema 1.1. One possibility is
to say that the association goes as far as the state machine matching ...
Thanks,
Sandy Gao
XML Parser Development, IBM Canada
(1-905) 413-3255
sandygao@ca.ibm.com
Arthur Ryman/Toronto/IBM
07/22/2005 02:35 PM
To
Roberto Chinnici <Roberto.Chinnici@Sun.COM>, Sandy Gao/Toronto/IBM@IBMCA
cc
David Orchard <dorchard@bea.com>, "Rogers, Tony" <Tony.Rogers@ca.com>,
www-ws-desc@w3.org, www-ws-desc-request@w3.org
Subject
Re: LC124: Comment on V2S and [validity]=notKnown
Roberto,
That seems too good to be true. I'd need to plunge back into the schema
spec for another day to be confident.
Sandy, what's your opinion? Can we prune out unexpected content this way?
Arthur Ryman,
Rational Desktop Tools Development
phone: +1-905-413-3077, TL 969-3077
assistant: +1-905-413-2411, TL 969-2411
fax: +1-905-413-4920, TL 969-4920
mobile: +1-416-939-5063, text: 4169395063@fido.ca
intranet: http://labweb.torolab.ibm.com/DRY6/
Roberto Chinnici <Roberto.Chinnici@Sun.COM>
Sent by: www-ws-desc-request@w3.org
07/20/2005 09:19 PM
To
"Rogers, Tony" <Tony.Rogers@ca.com>
cc
David Orchard <dorchard@bea.com>, www-ws-desc@w3.org
Subject
Re: LC124: Comment on V2S and [validity]=notKnown
Rogers, Tony wrote:
> One of the "interesting" aspects of the problem is that we must solve is
> how we decide on the interpretation of ambiguous results.
>
> For example, it will be legal to take your example:
>
>
> <type name="shipto">
>
> <sequence>
>
> <element ref="ad:name" minOccurs="1" maxOccurs="unbounded"/>
>
> <element ref="nad:country" minOccurs="0"/>
>
> </sequence>
>
> </type>
>
>
>
> (yes, I meant to change that to minOccurs)
>
>
>
> and feed it data like:
>
>
>
> <shipto>
>
> <ad:name>fred</ad:name>
>
> <nad:country>Australia</nad:country>
>
> <ad:name>bill</ad:name>
>
> </shipto>
>
>
>
> which can legitimately be interpreted (after ignorance has been applied)
as:
>
>
>
> <shipto>
>
> <ad:name>fred</ad:name>
>
> <ad:name>bill</ad:name>
>
> </shipto>
>
>
>
> OR
>
>
>
> <shipto>
>
> <ad:name>fred</ad:name>
>
> <nad:country>Australia</nad:country>
>
> </shipto>
>
>
>
> The latter is my expected interpretation (and may well be the easier to
> program), but the former is legitimate (it takes the approach of
> grabbing as many ad:name elements as it can, and it still satisfies the
> schema).
>
>
>
> What do other people think?
I tend to go with the first interpretation.
Here's how I'd define the "ignore unexpected" rule. This definition is
not phrased directly in terms of XML Schema, and I don't claim that it
would be trivial to do so, quite the contrary. Nevertheless, it seems
compatible with it; if anybody thinks otherwise, please point out where
I'm wrong.
That scourge of all schema authors, the UPA rule, was introduced to
make sure the schema was determistic. I assume then that at any
given stage during the parsing of the contents of an element, the
set of start tages that can legally be encountered is determined
and each tag in that set is associated with exactly one transition
to a new state. (I believe we can safely ignore character content
for the purposes of our discussion.)
Note that the set above, or better the set of names of all start tags
that can be encountered at any given state, may be infinite due to
the presence of a wildcard. This doesn't cause any problems -- all
we need is that the characteristic function of this set be computable.
Off the top of my head, I don't think that substitution groups would
be an issue either, they just make the construction of the set more
complex, nor would xsi:schemaLocation.
Now, the "ignore unexpected" rule is defined as saying that if at
a given state the processor encounter a start tag for an element
whose name is not in the set of expected start tags for that state,
the element is discarded. Subsequently, the processor keeps
operating in the same state it was into (where would it transition
to otherwise?), as if the discarded element had never been there.
Surely there are a few more tweaks that we need to do, like requiring
for some special treatment for the root element of a document and
dealing with attributes, but I hope that the definition I proposed
is clear enough.
If we apply it to the example then, we obtain that
<shipto>
<ad:name>fred</ad:name>
<nad:country>Australia</nad:country>
<ad:name>bill</ad:name>
</shipto>
will be treated as
<shipto>
<ad:name>fred</ad:name>
<ad:name>bill</ad:name>
</shipto>
Let's look at a slightly more interesting example.
Assume the following schema: (note the maxOccurs="2")
<type name="shipto">
<sequence>
<element ref="ad:name" minOccurs="1" maxOccurs="2"/>
<element ref="nad:country" minOccurs="0"/>
</sequence>
</type>
Then this document:
<shipto>
<ad:name>fred</ad:name>
<nad:country>Australia</nad:country>
<ad:name>bill</ad:name>
<nad:country>New Zealand</nad:country>
<ad:name>jim</ad:name>
</shipto>
will be treated as:
<shipto>
<ad:name>fred</ad:name>
<ad:name>bill</ad:name>
</shipto>
Thanks,
Roberto
Received on Friday, 22 July 2005 23:36:50 UTC