- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Tue, 10 Aug 2010 19:40:20 -0600
- To: Casey Jordan <casey.jordan@jorsek.com>
- Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, xmlschema-dev@w3.org
On 10 Aug 2010, at 11:16 , Casey Jordan wrote: > > > Here is the main problem, the greedy algorithm I developed does not > work for patterns like this: > > <sequence> > <element ref="e1"/> > <choice minOccurs="2"> > <element ref="e2" maxOccurs="2"/> > <element ref="e3"/> > </choice> > <element ref="e4"/> > </sequence> > > This can be seen as a regular expression e1( e2{1,2} | e3 ){2,2} e4. > My greedy algorithm cannot validate this correctly. > > Assuming input like <e1/><e2/><e2/><e4/> > > My algorithm will match all the way up to the <e4/> in the first > pass through the choice: ( e2{1,2} | e3 ){2,2} because it does not > realize that both <e2/> elements should be treated individually as > matches for the choice, and not a single match for the choice. > Validation then fails because it is expecting one more of <e2/> or > <e3/>. There is not an easy way to build this into my current > algorithm. There is a good, solid discussion and solution of this issue in the paper "Towards efficient implementation of XML schema content models", by Pekka Kilpeläinen and Rauno Tuhkanen, in the proceedings of Document Engineering 2004. See http://portal.acm.org/citation.cfm?doid=1030397.1030441 for more bibliographic information and a link to the full text (for subscribers to the ACM digital library). It may be worth while to note that the problem you have encountered results from a blunder on the part of the XSD working group, which failed to notice, when introducing numeric ranges for occurrence indicators, that the difference between weak determinism and strong determinism was now important and visible in the language, and thus failed to reach any considered decision on whether to require weak determinism or strong determinism. As the rule known as the 'Unique Particle Attribution' constraint shows, the WG ended up with weak determinism, but this was not a conscious choice. I wish the WG had had the guts to fix this design error in 1.1, but we did not. Instead, each implementation has found a different way of fudging the issue, with the result that schemas which exhibit the problem (i.e. content models which are weakly deterministic but not strongly deterministic) are unambiguously legal and have a well defined interpretation, but relatively few implementations actually implement what the spec says, and no two of them are likely to implement exactly the same thing. The schema author thus gets the best of both worlds: different implementations will do different things with the schema, but since the schema is legal, it's unlikely that the schema author will be alerted to the difficulty. -- **************************************************************** * C. M. Sperberg-McQueen, Black Mesa Technologies LLC * http://www.blackmesatech.com * http://cmsmcq.com/mib * http://balisage.net ****************************************************************
Received on Wednesday, 11 August 2010 01:39:08 UTC