Re: Ambiguous content models -- allowed or disallowed by XSDL? from Henry S. Thompson on 2002-06-13 (xmlschema-dev@w3.org from June 2002)

From: Henry S. Thompson <ht@cogsci.ed.ac.uk>
Date: 13 Jun 2002 09:44:07 +0100
To: Ian Stokes-Rees <ijs@decisionsoft.com>
Cc: xmlschema-dev@w3.org
Message-ID: <f5b1ybb7elk.fsf@cogsci.ed.ac.uk>

I'd be happy to look at specific cases where there's debate in your
group and/or disagreement among validators -- I'm not prepared to go
on a fishing expedition through all your schema docs.  I'm sure others 
on the list will help as well.

As for ambiguity, it's crucial to distinguish between unique
attribution and ambiguity.  Neither SGML DTDs, nor XML DTDs, nor W3C
XML Schema, have ever ruled out ambiguous content models as such.
They've all required some version of unique attribution.

Ambiguity means more than one possible analysis of the input is
possible wrt the content model.  Unique attribution requires online
(i.e. without lookahead) unequivocal determination of which particle
to try.  So for UPA to be violated, there _must_ be two particles
which could allow some element, i.e. two element decls with the same
name, or one element decl and an overlapping wildcard, or two
overlapping wildcards.  A content model without any of those _can't_
violate UPA.  So although UPA rules out many many ambiguous models, it 
doesn't rule them all out.  First of all, all the ones allowed by SGML 
and XML are allowed, i.e. (a+)*, (a*)+, etc.  Second of all, the ones
created by the advent of numeric exponents are allowed, i.e.

  <sequence minOccurs='2' maxOccurs='3'>
   <element name='a' minOccurs='2' maxOccurs='3'/>
  </sequence>

which gives two analyses for a sequence of six <a>s.

Summary: Ambiguity and unique attribution are different -- the *ML
family have never ruled out the former, always required the latter.

I hope it's clear _why_ this is reasonable: unique attribution matters
-- without it, parsers have to handle arbitrary parallelism and/or
backtracking, and content model subsumption checking would be much
harder (exponentially harder).  Ambiguity doesn't matter -- it's easy
to build a parser that accepts all and only valid content, without
wasting effort, and since there's no way to exploit which analysis was
used, the fact that both are not available doesn't matter.

Hope this helps,

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2002, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/
 [mail really from me _always_ has this .sig -- mail without it is forged spam]

Received on Thursday, 13 June 2002 04:44:11 UTC