DTD equivalence and expressive power

In its meeting on 2 October, the ERB reached consensus on the following
issues relating to the equivalence of document instances and of DTDs in
XML and SGML.  The brief statement of the points of consensus is
followed by some discussion and examples.

1 For any XML DTD XD, it will be possible to generate, without human
intervention, an SGML DTD SD, such that
   (a) SD will accept all the document instances accepted by XD, and
   (b) SD will produce the same ESIS for them (modulo any exceptions
       required by the XML handling of white space and record

2 If possible, XML will be defined in such a way that for any XML DTD
XD, a corresponding SGML DTD SD can be generated, without human
intervention, such that in addition to 1(a) and 1(b),
   (a) SD will accept *only* documents which are ESIS-equivalent to
       some document instance accepted by XD, and
   (b) if SD is translated back into XML, producing a third DTD XD',
       then XD and XD' will accept an ESIS-equivalent set of
       documents (i.e. for each document accepted by XD, there is a
       document accepted by XD' which has the same ESIS, and for
       each document accepted by XD', there is a document accepted by
       XD which has the same ESIS).



As may be seen, point 2 puts a slightly heavier burden on XML than point
1, requiring in item (a) that if XML DTDs are translated into SGML, the
resulting DTD enforces all the constraints of the original XML DTD, and
in item (b) that XML DTDs preserve their expressive power and accept
equivalent languages even after round-trip conversion into and from full
SGML.  It's not clear to everyone that this heavier burden can always be
met, so point 2 is expressed as a goal, not a hard requirement.

Another way of expressing point 2(a) is that XML will not have greater
expressive power than Full SGML.  This means, for example, that point
2(a) forbids XML to accept arbitrary regular expressions as content
models, since some regular expressions cannot be translated into SGML
content models.  A hypothetical XML DTD with

  <!ELEMENT x     - - ((a,b)*, a?) >
  <!ELEMENT (a,b) - O EMPTY        >

could in theory be translated into SGML as

  <!ELEMENT x     - - (a,b?)* >
  <!ELEMENT (a,b) - O EMPTY   >

which would fulfil the requirements of point 1, since any document
satisfying the first declaration also satisfies the second.  A document
containing <x><a><a></x>, however, would satisfy the SGML DTD without
satisfying the XML DTD.  Rule 2(a) says XML can't allow that to happen.
This has the effect that XML and SGML tools can both preserve the
validity of XML documents, assuming that they validate the documents
at all.

The view of the ERB is, in short, that suggestions for increasing the
expressive power of SGML DTDs -- of which we have several -- will need
to go into the WG8 revision work, not into XML.

It should probably be noted that the ERB did not discuss, and so has not
achieved consensus, on whether XML DTDs may be *less* expressive than
SGML DTD's, i.e. whether rule 2 should also work if one switched the
names SGML and XML around in it.  That is, the jury is still out on
whether eliminating constructs like EMPTY, inclusion and exclusion
exceptions, etc., is ruled out in principle or not.


Further discussion for the hard-core set theorists ...

If we consider that each DTD defines (or 'generates') a language, then
the set of DTDs possible using some notation generates a set of
languages.  Call it the LG-set (for 'languages generated set') of the
notation.  Formally, rule 2(a) requires that LG-set of XML be a subset
of SGML's LG-set.

Adding a rule 3 which replaces 'SGML' with 'XML' and vice versa would
require that SGML's LG-set be a subset of XML's, i.e.  that the two
LG-sets be identical (using, as always, ESIS-equivalence module RE/RS
differences as a test of identity of instances).