Re: Regular expressions and content models (was: Re: questions about element declarations)

David G. Durand <dgd@cs.bu.edu> wrote:

>    Note that if XML follows the lead of Hytime, and says that any content
> model producing the same sequence of elements is good, we will have removed
> one of the perennial thorns in the side of DTD designers, and those who
> have to train them. Conversion from XML "any REGEXP goes" content models to
> SGML content models is possible (by the theorem proving the equivalence of
> DFAs and NDFAs and REs).

There are regular languages for which no unambiguous
SGML content model exists; (a, (b,a)*, b?)?, for instance.
(Look for Anne Brueggeman-Klein's papers cited in the SGML
Bibliography for more details...)  So this won't work.


> 2.) As you note, SGML requires that the parser match syntactic occurrences
> in the content model with instances in the document. Any model like (a?, a)
> is a problem because we don't know _which_ "a" we've seen. construed as a
> regular expression this is a non-problem because the set of strings defined
> is the same as than defined by (a, a?) or (a | (a,a)). Equivalence of
> regular expressions is defined in terms of the sequence of the elements
> they describe.

This would be a non-problem for SGML as well were it
not for OMITTAG.  The rules for deciding when an element
is "contextually required" make it necessary for the parser
to know which content token is satisfied by input symbols as
soon as they have been read.

> The only sensible way I know to implement this correctly is
> to implement content models as NDFAs and then ensure that there is no place
> where the non-determinism is used.

Brueggeman-Klein also tackles this problem; detecting
ambiguity turns out to be quite difficult!

>    Now, I think there are two problems with the way SGML does things:
>    a) It's a shame that we've invented "almost regular" expressions, and
> done it in a way that _prevents_ application of existing tools and basic
> theory.

Actually, since SGML content models define a strict subset
of the regular languages, any technique for processing r.e.s
can also be used to process content models (once you address the
problems of AND groups and OMITTAG of course; but the ambiguity
restriction doesn't make parsing any harder).

>    It's also worth noting that people use naive parsers of ambiguous
> regular expressions in text-editors every day, and it's not a problem. So
> the efficiencies that motivate the technical argument for eliminating
> ambiguity may not be very important anyhow.


> RE delenda est.

--Joe English