Re: Looking for help. Implementing a XML validation engine in JavaScript. from C. M. Sperberg-McQueen on 2010-08-11 (xmlschema-dev@w3.org from August 2010)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Tue, 10 Aug 2010 19:40:20 -0600
To: Casey Jordan <casey.jordan@jorsek.com>
Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, xmlschema-dev@w3.org
Message-Id: <7E8B5DD9-3AF4-4445-BE17-39C8501C1F0C@blackmesatech.com>

On 10 Aug 2010, at 11:16 , Casey Jordan wrote:
>
>
> Here is the main problem, the greedy algorithm I developed does not  
> work for patterns like this:
>
> <sequence>
>     <element ref="e1"/>
>     <choice minOccurs="2">
>         <element ref="e2" maxOccurs="2"/>
>         <element ref="e3"/>
>     </choice>
>     <element ref="e4"/>
> </sequence>
>
> This can be seen as a regular expression e1( e2{1,2} | e3 ){2,2} e4.  
> My greedy algorithm cannot validate this correctly.
>
> Assuming input like <e1/><e2/><e2/><e4/>
>
> My algorithm will match all the way up to the <e4/> in the first  
> pass through the choice: ( e2{1,2} | e3 ){2,2} because it does not  
> realize that both <e2/> elements should be treated individually as  
> matches for the choice, and not a single match for the choice.  
> Validation then fails because it is expecting one more of <e2/> or  
> <e3/>. There is not an easy way to build this into my current  
> algorithm.

There is a good, solid discussion and solution of this issue
in the paper "Towards efficient implementation of XML schema
content models", by Pekka Kilpeläinen and Rauno Tuhkanen,
in the proceedings of Document Engineering 2004. See
http://portal.acm.org/citation.cfm?doid=1030397.1030441
for more bibliographic information and a link to the full
text (for subscribers to the ACM digital library).

It may be worth while to note that the problem you have
encountered results from a blunder on the part of the XSD working
group, which failed to notice, when introducing numeric ranges
for occurrence indicators, that the difference between
weak determinism and strong determinism was now important
and visible in the language, and thus failed to reach any
considered decision on whether to require weak determinism
or strong determinism.  As the rule known as the 'Unique Particle
Attribution' constraint shows, the WG ended up with weak
determinism, but this was not a conscious choice. I wish the
WG had had the guts to fix this design error in 1.1, but we
did not.

Instead, each implementation has found a different way of
fudging the issue, with the result that schemas which exhibit
the problem (i.e. content models which are weakly deterministic
but not strongly deterministic) are unambiguously legal and have a
well defined interpretation, but relatively few implementations
actually implement what the spec says, and no two of them are
likely to implement exactly the same thing.  The schema
author thus gets the best of both worlds:  different implementations
will do different things with the schema, but since the schema
is legal, it's unlikely that the schema author will be
alerted to the difficulty.

-- 
****************************************************************
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
* http://www.blackmesatech.com
* http://cmsmcq.com/mib
* http://balisage.net
****************************************************************

Received on Wednesday, 11 August 2010 01:39:08 UTC