[Prev][Next][Index][Thread]

equivalent power in SGML and XML



Eve Maler's note with questions about the equivalence of SGML and XML
raises serious questions, but before I try to answer them I want to
pose even more questions.  I apologize for the length of this posting,
but I'm just working out these ideas and I don't understand them
well enough to summarize them crisply.

We say that XML and SGML should have 'essentially the same expressive
power', and my own attempt to answer Eve's questions depends on
clarifying what that means.

First, let's distinguish equivalence among document instances from
equivalence among DTDs.

I Instance equivalence

Instances can be:
  - byte-equivalent, if each octet of each entity in the document is
    the same
  - character-equivalent, if each 'character' in each entity is the same,
    but the encoding of the character may differ:  a document in
    Ascii and the same document after translation into Ebcdic is a
    common-place and old-fashioned example; the same document in
    ISO 8859-1 and its equivalent in UCS-2 or UTF-8 is a similar and
    forward-looking example.
  - external-entity-equivalent, if each character in each *external*
    entity is the same (after resolving references to internal entities
    and character references)
  - ESIS-equivalent, if their document elements are E-equivalent, where
    elements are E-equivalent if and only if:
       * they have the same generic identifier
       * they have the same set of attributes
       * corresponding attributes have the same values (after
         expansion of entity references)
       * they have the same number of children, and all pairs of
         corresponding children are E-equivalent
    and character data chunks are E-equivalent if they consist of
    the same sequence of 'characters' (which does not mean they are
    in the same encoding).  This may differ slightly in detail from
    the definition of ESIS in the Corrigendum, but not intentionally.
  - EE-ESIS-equivalent, if they are ESIS equivalent and the beginnings
    and endings of external entities occur at the same places in the
    ESIS.  (This isn't well phrased; I hope it's reasonably clear
    what I mean:  ESIS equivalent, plus division into external entities
    is preserved.)

I need some examples:

1 <p>Hello, world!</p> is byte equivalent to <p>Hello, world!</p>
and to the equivalent in EBCDIC (or in ASCII, I should say, because
as I write this, it *is* in EBCDIC.

2 If we have <!ENTITY greet 'Hello'> and <!ENTITY someone 'world'>
then <p>Hello, world!</p> is ESIS-equivalent, but not character
equivalent, to <p  >&greet;, &someone;!</>

3 A document divided into one file per chapter is ESIS-equivalent, but
not EE-ESIS-equivalent, with the same document in a single file (i.e.
with all references to external entities resolved).

If two instances have the same DTD and are byte- or
character-equivalent, they are guaranteed ESIS-equivalent, but not vice
versa.  If they are external-entity equivalent and have the same DTD,
they are ESIS-equivalent; I can't think of a way to preserve entity
equivalence as defined here without preserving ESIS equivalence.


II DTD equivalence

Now for DTD equivalence.  Normally we think of this in terms of the set
of effective declarations, after resolution of entity references and
marked sections, etc.  Here, though, I think we should focus on the
languages defined by two DTDs.  We need to clarify what we mean by
'equivalent expressive power'.

In general (the computer scientists among us may correct me freely) I
think grammars are equivalent if they define the same languages, i.e. if
they accept the same strings as members of the language and reject the
same non-members.  One formalism F has the same power as another
formalism G if for each grammar written using F, an equivalent grammar
may be written using G, and vice versa.  F is more powerful than G
if the set of languages definable by F's grammars is a proper superset
of the set of languages definable by G's grammars.

Now, if XML's DTDs are to be 'as powerful' as SGML's, they have to
be able to define the same set of languages:  for every SGML DTD there
must be an equivalent XML DTD accepting and rejecting exactly the same
set of valid and invalid document instances.  This, we cannot do,
if we are set on doing without CDATA and RCDATA elements.  (Or even
DATATAG and RANK!)

On the other hand, we can say that for our purposes XML DTDs and SGML
DTDs have 'essentially' the same expressive power if for any SGML
DTD, there is an XML DTD such that:

  - for every document instance accepted by the SGML DTD, the XML
    DTD accepts a document which is 'equivalent'
  - for every document instance accepted by the XML DTD, the SGML
    DTD accepts a document which is 'equivalent'.

We can choose, I think, to require that the documents be ESIS- or
EE-ESIS-equivalent, or apply some other test, depending on how strict we
want to be.  We could also insist on byte-equivalence or
character-equivalence, but if we do so we can't make any real changes to
DTDs.

Dropping DATACHAR and RANK and SHORTREF don't bother people, I think,
because we're all used the notion of normalizing SGML documents, and the
normalized forms of documents share ESIS with their originals, but
aren't byte-equivalent.

If we drop CDATA and RCDATA elements, we can meet the same standard:
any CDATA element in an accepted SGML instance can be replaced, in XML,
with the same thing, but with < and & escaped using any of the various
possibilities discussed earlier.  If we drop the & connector from the
content model syntax, as I suspect some people would like to do, then we
haven't changed the expressive power of the language because (in theory)
it is always possible to replace a content model using & with an
equivalent one not using &.  It may be ugly and very very long -- and
for these reasons no one is likely to do it -- but it's *possible*.

Dropping EMPTY, on the other hand, would seriously affect the expressive
power of XML DTDs, since we could not guarantee the second
DTD-equivalence clause above.


III What is needed?

I think we should shoot for the following:

 - any existing SGML document can be translated into an EE-ESIS
equivalent document (i.e. same ESIS, same gross entity structure, but
there may be more, or fewer, internal entities and references)
 - any existing SGML DTD can be translated into an XML DTD which
recognizes a set of documents EE-ESIS-equivalent to the original DTD.
 - a document translated from SGML into XML into SGML should be
EE-ESIS-equivalent to the original; same for one that goes X-S-X.

I don't think we should shoot for more than EE-ESIS equivalence after
round-trip transmission.  Bytes will change.  References to internal
entities will be resolved, or introduced.  That means there may be
additional entity declarations in the DTD, and thus we can't even
guarantee that the document will parse with the same DTD.  I think we
can guarantee that it will parse with the same set of effective element,
attlist, and notation declarations.




-C. M. Sperberg-McQueen


Follow-Ups: