equivalent power in SGML and XML
Eve Maler's note with questions about the equivalence of SGML and XML
raises serious questions, but before I try to answer them I want to
pose even more questions. I apologize for the length of this posting,
but I'm just working out these ideas and I don't understand them
well enough to summarize them crisply.
We say that XML and SGML should have 'essentially the same expressive
power', and my own attempt to answer Eve's questions depends on
clarifying what that means.
First, let's distinguish equivalence among document instances from
equivalence among DTDs.
I Instance equivalence
Instances can be:
- byte-equivalent, if each octet of each entity in the document is
- character-equivalent, if each 'character' in each entity is the same,
but the encoding of the character may differ: a document in
Ascii and the same document after translation into Ebcdic is a
common-place and old-fashioned example; the same document in
ISO 8859-1 and its equivalent in UCS-2 or UTF-8 is a similar and
- external-entity-equivalent, if each character in each *external*
entity is the same (after resolving references to internal entities
and character references)
- ESIS-equivalent, if their document elements are E-equivalent, where
elements are E-equivalent if and only if:
* they have the same generic identifier
* they have the same set of attributes
* corresponding attributes have the same values (after
expansion of entity references)
* they have the same number of children, and all pairs of
corresponding children are E-equivalent
and character data chunks are E-equivalent if they consist of
the same sequence of 'characters' (which does not mean they are
in the same encoding). This may differ slightly in detail from
the definition of ESIS in the Corrigendum, but not intentionally.
- EE-ESIS-equivalent, if they are ESIS equivalent and the beginnings
and endings of external entities occur at the same places in the
ESIS. (This isn't well phrased; I hope it's reasonably clear
what I mean: ESIS equivalent, plus division into external entities
I need some examples:
1 <p>Hello, world!</p> is byte equivalent to <p>Hello, world!</p>
and to the equivalent in EBCDIC (or in ASCII, I should say, because
as I write this, it *is* in EBCDIC.
2 If we have <!ENTITY greet 'Hello'> and <!ENTITY someone 'world'>
then <p>Hello, world!</p> is ESIS-equivalent, but not character
equivalent, to <p >&greet;, &someone;!</>
3 A document divided into one file per chapter is ESIS-equivalent, but
not EE-ESIS-equivalent, with the same document in a single file (i.e.
with all references to external entities resolved).
If two instances have the same DTD and are byte- or
character-equivalent, they are guaranteed ESIS-equivalent, but not vice
versa. If they are external-entity equivalent and have the same DTD,
they are ESIS-equivalent; I can't think of a way to preserve entity
equivalence as defined here without preserving ESIS equivalence.
II DTD equivalence
Now for DTD equivalence. Normally we think of this in terms of the set
of effective declarations, after resolution of entity references and
marked sections, etc. Here, though, I think we should focus on the
languages defined by two DTDs. We need to clarify what we mean by
'equivalent expressive power'.
In general (the computer scientists among us may correct me freely) I
think grammars are equivalent if they define the same languages, i.e. if
they accept the same strings as members of the language and reject the
same non-members. One formalism F has the same power as another
formalism G if for each grammar written using F, an equivalent grammar
may be written using G, and vice versa. F is more powerful than G
if the set of languages definable by F's grammars is a proper superset
of the set of languages definable by G's grammars.
Now, if XML's DTDs are to be 'as powerful' as SGML's, they have to
be able to define the same set of languages: for every SGML DTD there
must be an equivalent XML DTD accepting and rejecting exactly the same
set of valid and invalid document instances. This, we cannot do,
if we are set on doing without CDATA and RCDATA elements. (Or even
DATATAG and RANK!)
On the other hand, we can say that for our purposes XML DTDs and SGML
DTDs have 'essentially' the same expressive power if for any SGML
DTD, there is an XML DTD such that:
- for every document instance accepted by the SGML DTD, the XML
DTD accepts a document which is 'equivalent'
- for every document instance accepted by the XML DTD, the SGML
DTD accepts a document which is 'equivalent'.
We can choose, I think, to require that the documents be ESIS- or
EE-ESIS-equivalent, or apply some other test, depending on how strict we
want to be. We could also insist on byte-equivalence or
character-equivalence, but if we do so we can't make any real changes to
Dropping DATACHAR and RANK and SHORTREF don't bother people, I think,
because we're all used the notion of normalizing SGML documents, and the
normalized forms of documents share ESIS with their originals, but
If we drop CDATA and RCDATA elements, we can meet the same standard:
any CDATA element in an accepted SGML instance can be replaced, in XML,
with the same thing, but with < and & escaped using any of the various
possibilities discussed earlier. If we drop the & connector from the
content model syntax, as I suspect some people would like to do, then we
haven't changed the expressive power of the language because (in theory)
it is always possible to replace a content model using & with an
equivalent one not using &. It may be ugly and very very long -- and
for these reasons no one is likely to do it -- but it's *possible*.
Dropping EMPTY, on the other hand, would seriously affect the expressive
power of XML DTDs, since we could not guarantee the second
DTD-equivalence clause above.
III What is needed?
I think we should shoot for the following:
- any existing SGML document can be translated into an EE-ESIS
equivalent document (i.e. same ESIS, same gross entity structure, but
there may be more, or fewer, internal entities and references)
- any existing SGML DTD can be translated into an XML DTD which
recognizes a set of documents EE-ESIS-equivalent to the original DTD.
- a document translated from SGML into XML into SGML should be
EE-ESIS-equivalent to the original; same for one that goes X-S-X.
I don't think we should shoot for more than EE-ESIS equivalence after
round-trip transmission. Bytes will change. References to internal
entities will be resolved, or introduced. That means there may be
additional entity declarations in the DTD, and thus we can't even
guarantee that the document will parse with the same DTD. I think we
can guarantee that it will parse with the same set of effective element,
attlist, and notation declarations.
-C. M. Sperberg-McQueen