RE: Ambiguous content models -- allowed or disallowed by XSDL? from Lehmann, Steen on 2002-06-13 (xmlschema-dev@w3.org from June 2002)

From: Lehmann, Steen <slehmann@silverstream.com>
Date: Thu, 13 Jun 2002 12:36:18 +0200
To: "'Jeni Tennison'" <jeni@jenitennison.com>
Cc: xmlschema-dev@w3.org
Message-ID: <FD8A740186525243807FBCDB7E739D0585F3A4@EXBE1.silverstream.be>

>   <xs:sequence minOccurs="2" maxOccurs="2">
>     <xs:element name="a" minOccurs="1" maxOccurs="2" />
>     <xs:element name="b" minOccurs="0" />
>   </xs:sequence>
> 
> which fulfils the unique attribution constraint since there is only
> one particle for each of the two elements a and b, but is ambiguous
> because if you have:
> 
>   <a /><a /><b />
> 
> then you don't know whether you've got to the end of the content model
> (the first a comes from the first occurrence of the sequence, the
> remainder from the second occurrence of that sequence) or if you're
> still within the first occurrence of the sequence.
> 
> For my education, could you explain (or point me to something that
> explains) how parsers manage to accept the sequence a, a, b without
> backtracking?

A parser I've created translates each complexType into a non-deterministic
graph representation (an "NFA"), which is then turned into a deterministic
version (a "DFA") by a standard, well-known algorithm. I suspect this may be
overkill because of the UPA, but was kind of fun to do (I don't get out much
:).

The NFA created from the innards of the <sequence> above would have five
states with transitions between them, each transition would be labelled a, b
or (empty), meaning that when the parser is in a state it will change to
another state if there is a transition between the two states matching the
current symbol it is trying to parse (or if there is a transition for
(empty) between the two states). It may look like this, numbering the states
and labelling empty transitions E. Computation starts in state 0 and is
successful if it ends in state 4.

0 --a--> 1
1 --a--> 2
1 --b--> 3
1 --E--> 3
2 --b--> 3
2 --E--> 4
3 --b--> 3
3 --E--> 4

To also include the containing <sequence> simply repeat the whole graph
twice, copying every transition out of state 0 to state 4, resulting in
states 5 to 8 (copies of 1 to 4)

Now, to turn this NFA into a corresponding DFA we "collapse" the empty
transitions from each state and copy transitions from states we can reach by
following empty transitions back to the original state (known as evaluating
lambda closure). The resulting DFA has no empty transitions, so at each
state we know immediately if a parsed symbol is legal (there is a transition
somewhere for that symbol) without a need for backtracking. 

Here it is:
0 --a--> 1
1 --a--> 3
1 --b--> 2
2 --a--> 4
2 --b--> 2
3 --a--> 4
3 --b--> 2
4 --a--> 6
4 --b--> 5
5 --b--> 5
6 --b--> 5

The parsed input is correct if evaluation terminates in state 5 or 6. Hope
this was educational. I'm sure there are more efficient ways of doing it,
but the theorietical basis for this approach is well-understood and fairly
easy to apply.

Cheers,

-- Steen

/**
 * Steen Lehmann - <mailto:slehmann@silverstream.com>
 * Senior Software Engineer (R&D), SilverStream Software 
 * <http://www.silverstream.com>
 */

Received on Thursday, 13 June 2002 06:36:52 UTC