Re: UPA — (Why) Is There a Difference Between Those Two? from C. M. Sperberg-McQueen on 2011-06-23 (xmlschema-dev@w3.org from June 2011)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Thu, 23 Jun 2011 15:28:12 -0700
To: Denis Zawada <deno@deno.pl>
Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, xmlschema-dev@w3.org
Message-Id: <B52D6375-9B0D-4F75-9DE2-DEA57C837069@blackmesatech.com>
On Jun 23, 2011, at 2:25 PM, Denis Zawada wrote:

> Hello, 
> 
> I wanted to make sure that I understand this rule correctly. 
> 
> Foo defined in the following way will result in error in both 
> MSXML and libxml2:
> 
>  <xs:element name="non-valid-foo">
>    <xs:complexType>
>      <xs:sequence>
>        <xs:sequence>
>          <xs:element name="bar" minOccurs="2" maxOccurs="5"/>
>          <xs:element name="xyz" minOccurs="0"/>
>        </xs:sequence>
>        <xs:sequence>
> 		 <!-- bar is ambiguous -->
>          <xs:element name="bar" minOccurs="2" maxOccurs="5"/>
>          <xs:element name="xyz" minOccurs="0"/>
>        </xs:sequence>
>      </xs:sequence>
>    </xs:complexType>
>  </xs:element>
> 
> However both parsers have no problem with foo defined in a following way:
> 
>  <xs:element name="valid-foo">
>    <xs:complexType>
>      <xs:sequence minOccurs="2" maxOccurs="2">
>        <xs:element name="bar" minOccurs="2" maxOccurs="5"/>
>        <xs:element name="xyz" minOccurs="0"/>
>      </xs:sequence>
>    </xs:complexType>
>  </xs:element>

Both processors are correctly enforcing the UPA constraint.

> 
> I understand that XML Schema only mentions that *two* adjacent particles can 
> overlap:
> 
>> A content model will violate the unique attribution constraint if it 
> contains *two* particles which ·overlap· and which either (…)

I'm not sure I understand what you're saying here.  (Specifically, I
don't understand the implications of your word "only", and I note
that the passage you quote from the spec says nothing about
adjacency of the particles.)

In the case of the type of non-valid-foo, the two particles which
compete are the two particles which define the two element types
named 'bar'.

> 
> On the other hand, I could easily imagine that it would be possible to convert 
> the 1st form into the 2nd one in a preprocessing step during compilation of 
> schema. 

Agreed.  It's clearly possible to have tools that will translate some 
content models which violate the UPA constraint into models that
do not violate it.  It is known from work by Anne Brüggemann-Klein 
and Derick Wood, however, that not all regular languages on elements 
have content models which obey the UPA.  I have the impression that
tools for translating content models into UPA-compliant content models
are not widely available, perhaps because it's impossible to guarantee
that they will always succeed.

> 
> Is my understanding of this principle correct? I.e. if particles are implicit 
> is certain ambiguity allowed? Why first example validates differently than the 
> 2nd one?

I'm not sure what you mean by 'implicit' here.

The first example violates the UPA constraint because it contains two
element particles (represented in the XML by the first and third 
xs:element elements) which compete with each other.  The second
example does not violate the UPA constraint because it does not 
contain two particles which compete.  (It contains no wildcard particles,
and it contains only two element particles; they match disjoint sets
of elements*, and do not compete.)

* They match disjoint sets of elements, that is, unless their substitution
groups match overlapping sets of names.

This is true despite the fact that the third 'bar' element in a sequence
exposes a non-determinism in the content model:  it could increment
either the inner or the outer counter.  Brüggemann-Klein and Wood
follow earlier work in distinguishing 'weak' determinism and 'strong' 
determinism (or 1-non-ambiguity).  In SGML and in XML DTDs, there is
no practical difference between the two (although SGML explicitly 
specifies that it is the inner counter which is incremented, not the
outer counter); in XSD the difference becomes relevant with the 
introduction of integer-valued occurrence indicators, but I don't believe
there is any record of the Working Group making a conscious choice 
between enforcing weak determinism and strong determinism, or even
being aware of the difference.  I lean toward the belief that we fell 
into the choice of weak determinism because of an accident of the
language used to express the constraint; when the WG became
aware of the issue, my recollection is that there was consensus (or
something very close to consensus) on the view that it would have
been better to require strong determinism in content models, or to
prescribe a rule analogous to SGML's rule to force the inner counters
to be incremented first. But a significant part of the WG felt that it was
too late to fix the error, because any change would create backward
compatibility issues.  So the WG as a whole did not have consensus
in favor of any change.

So at one level the answer to the question "Why is there a difference 
between the two?" is "because the rules require a deterministic 
choice of particle, not a fully deterministic automaton".

At another level, the answer is "because the WG failed to do its
homework fully and did not recognize that it was faced with a
design choice in this area".

I hope this helps.



-- 
****************************************************************
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
* http://www.blackmesatech.com 
* http://cmsmcq.com/mib                 
* http://balisage.net
****************************************************************
Received on Thursday, 23 June 2011 22:28:47 UTC