Content models from David G. Durand on 1996-10-02 (w3c-sgml-wg@w3.org from October 1996)

From: David G. Durand <dgd@cs.bu.edu>
Date: Wed, 2 Oct 1996 13:44:08 -0400
To: w3c-sgml-wg@w3.org
Message-Id: <v02130508ae784ff0ef55@[128.148.157.46]>
   Main points of this note:
   1. I was wrong earlier, SGML content models are strictly less-expressive
than full regular expressions

   2. Since the differences are a proven cause of confusion, we should just
use regular expressions (the standard literature offers cookbook solutions
for easier and slower as well as harder but faster faster implementations.
Content-model verification is unlikely to be the bottleneck in a parser
anyway.

   Because of 1, 2 implies that some XML content models will be stronger
than their translations to SGML content models.

   As Michael reminded me in a private note, SGML content models are _not_
equivalent to DFAs. They are strictly less-powerful. He also gave a
counterexample:
   ((a,b)*, a?)

This has a deterministic parsing automaton, but no non-ambiguous RE (in the
sense of SGML).

   As Michael reminded me, Anne Brueggemann-Klein's papers on SGML content
models describe the relevant theory and prove some useful results. On
re-reading rather shan skimming the papers, I felt better because I finally
had a clear definition of ambiguity, and an algorithm to test it, but it
reinforced my earlier comment:

>If you haven't guessed, I'd rather eliminate the ambiguity rules and be
>able to use a rich and well-understood theory instead of having to help
>invent a new theory that only allows things we might rather not allow
>anyway.

   The Brueggemann-Klein papers (I have tech reports, but they may have
made it to a journal by now) and the key literature they cite are
relatively obscure, compared to the stnadard RE algorithms.  It's
particularly bad that the results are not in the standard books like the
Dragon book. If there's one book that everyone  (even undergrads) already
knows they have to have if they're going to write a parser, that is it.

Michael added:
>Doing the instance-to-model match would make perfect sense if you
>expected application actions to be defined on 'the third X token of
>the content model'.  Maybe some members of WG8 expected that to
>happen.  But I've never seen software that allowed such specifications,
>and I have a hard time imagining users who would want it, either.

   Since people _don't_ do it now, that's an argument that we needn't worry
about what they might do, if they did have it.

   -- David

RE delenda est.

--------------------------------------------+--------------------------
David Durand                  dgd@cs.bu.edu | david@dynamicDiagrams.com
Boston University Computer Science          | Dynamic Diagrams
http://www.cs.bu.edu/students/grads/dgd/    | http://dynamicDiagrams.com/
Received on Wednesday, 2 October 1996 13:40:17 UTC