W3C home > Mailing lists > Public > xml-editor@w3.org > January to March 2001

Re: should all XML parsers reject non-deterministic content models?

From: Daniel Veillard <Daniel.Veillard@imag.fr>
Date: Sun, 14 Jan 2001 12:39:02 +0100
To: James Clark <jjc@jclark.com>
Cc: Daniel.Veillard@imag.fr, "TAKAHASHI Hideo(BSD-13G)" <hideo-t@bisd.hitachi.co.jp>, xml-editor@w3.org, xml-dev@lists.xml.org
Message-ID: <20010114123902.C26487@imag.fr>
On Sun, Jan 14, 2001 at 05:42:16PM +0700, James Clark wrote:
> Daniel Veillard wrote:
>  
> > On Sun, Jan 14, 2001 at 04:42:55PM +0900, TAKAHASHI Hideo(BSD-13G) wrote:
> > > Hello.
> > >
> > > I understand that the XML 1.0 spec prohibits non-deterministic (or,
> > > ambiguous) content models (for compatibility, to be precise).
> > 
> >   Note also that this is stated in a non-normative appendix.
> 
> It is also stated normatively in the body of the spec
> (http://www.w3.org/TR/REC-xml#sec-element-content): "For compatibility,
> it is an error if an element in the document can match more than one
> occurrence of an element type in the content model."

  Oops, right, I missed this sentence, sorry.

> >   In practice this is a very good rule because it allows to simplify
> > the validation of a content model a lot.
> 
> Complicating things for the user to make things simpler for the parser
> writer seems in general a bad trade-off to me.  Also this decreases
> interoperability (as you've observed). The only justification is
> compatibility with SGML.

  I'm more convinced by the agument that some content models may
not be expressable in an 1-unambiguous languages. But I'm still
a bit worried that either way interoperability will be a concern.
In this case compatibility with SGML generates risk of interoperability
problems between XML tools, "results are undefined" doesn't sound good
(http://www.w3.org/TR/REC-xml#dt-error).

> In fact, there is a very simple algorithm available that handles
> non-determinism just fine (it doesn't require you to construct a NFA and
> then do the subset construction).  See
> http://www.flightlab.com/~joe/sgml/validate.html (TREX uses a variation
> on this).

  Thanks for the pointer !
  
> The thesis proves the opposite: that there are some regular expressions
> that do not denote 1-unambiguous languages (see p52). She gives the
> example of
> 
>   (a|b)*,a,(a|b)

  Hum, right, this makes sense and is also stated in Appendix E.

    thanks a lot,

Daniel

-- 
Daniel Veillard      | Red Hat Network http://redhat.com/products/network/
daniel@veillard.com  | libxml Gnome XML toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
Received on Sunday, 14 January 2001 06:39:06 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 7 December 2009 10:59:31 GMT