Re: should all XML parsers reject non-deterministic content models?

Daniel Veillard wrote:
 
> On Sun, Jan 14, 2001 at 04:42:55PM +0900, TAKAHASHI Hideo(BSD-13G) wrote:
> > Hello.
> >
> > I understand that the XML 1.0 spec prohibits non-deterministic (or,
> > ambiguous) content models (for compatibility, to be precise).
> 
>   Note also that this is stated in a non-normative appendix.

It is also stated normatively in the body of the spec
(http://www.w3.org/TR/REC-xml#sec-element-content): "For compatibility,
it is an error if an element in the document can match more than one
occurrence of an element type in the content model."

> > Are all xml 1.0 compliant xml processing software required to reject
> > DTDs with such content models?
> 
>   Since it is stated as non-normatively only I don't think this is the
> case in theory.

It states normatively that it is an error. However, when the XML Rec
says merely that something is an error, a conforming XML processor is
not required to report it. If it said "it is a fatal error", then a
conforming XML processor would be required to report it.

>   In prectice this can be a problem. I recently faced a problem with
> a DtD developped at the IETF which was clearly non-determinist. This
> also means that this introduce new classes of XML parser among the
> validating ones:
>    - those who detect and report non-determinist content model
>    - those who validate (correctly) or not using non-determinist
>      content model

Yes.

> > Ambiguous content models doesn't cause any problems when you construct a
> > DFA via an NFA.  I have heard that there is a way to construct DFAs
> > directly from regexps without making an NFA, but that method can't
> > handle non-deterministic regular expressions.  If you choose that method
> > to construct your DFA, you will surely benefit from the rule in XML 1.0
> > . But if you choose not, detecting non-deterministic content models
> > become an extra job.
> 
>   I tried to read the Brüggemann-Klein thesis listed in reference and
> found it a bit frightening, though very informative. The beginning
> of the Part I on Document Grammar for example makes clear that SGML
> view of unambiguity of the content model is really a 1 token lookahead
> determinism.
>   In practice this is a very good rule because it allows to simplify
> the validation of a content model a lot.

Complicating things for the user to make things simpler for the parser
writer seems in general a bad trade-off to me.  Also this decreases
interoperability (as you've observed). The only justification is
compatibility with SGML.

In fact, there is a very simple algorithm available that handles
non-determinism just fine (it doesn't require you to construct a NFA and
then do the subset construction).  See
http://www.flightlab.com/~joe/sgml/validate.html (TREX uses a variation
on this).

> Problem is that grammars
> need to be rewritten to conform to it (the thesis proves it's always
> possible at lest).

The thesis proves the opposite: that there are some regular expressions
that do not denote 1-unambiguous languages (see p52). She gives the
example of

  (a|b)*,a,(a|b)

James

Received on Sunday, 14 January 2001 05:44:19 UTC