Re: equivalent power in SGML and XML

[Thanks, Charles, for correcting me on the ESIS/EMPTY issue...]

Michael -- compared to my poor attempt, your mail seemed crystal clear
to me!  You've explained why EMPTY isn't just another random kind of 
declared content.

I could easily agree to cut out *multiple* ways of expressing the same
markup language without getting down to *zero* ways, which meets our 
principle of having only one way to do things.  If we agreed to go
by this rule, mixed content would have to stay in...  Of course, we
could still decide that some things have to go, but I think the 
criteria should be much more stringent.  (Hmm, could RANK be simulated
for all practical purposes by GIs that have the number in them?)


At 06:46 PM 9/16/96 CDT, Michael Sperberg-McQueen wrote:
>Eve Maler's note with questions about the equivalence of SGML and XML
>raises serious questions, but before I try to answer them I want to
>pose even more questions.  I apologize for the length of this posting,
>but I'm just working out these ideas and I don't understand them
>well enough to summarize them crisply.
>We say that XML and SGML should have 'essentially the same expressive
>power', and my own attempt to answer Eve's questions depends on
>clarifying what that means.
>First, let's distinguish equivalence among document instances from
>equivalence among DTDs.
>I Instance equivalence
>Instances can be:
>  - byte-equivalent, if each octet of each entity in the document is
>    the same
>  - character-equivalent, if each 'character' in each entity is the same,
>    but the encoding of the character may differ:  a document in
>    Ascii and the same document after translation into Ebcdic is a
>    common-place and old-fashioned example; the same document in
>    ISO 8859-1 and its equivalent in UCS-2 or UTF-8 is a similar and
>    forward-looking example.
>  - external-entity-equivalent, if each character in each *external*
>    entity is the same (after resolving references to internal entities
>    and character references)
>  - ESIS-equivalent, if their document elements are E-equivalent, where
>    elements are E-equivalent if and only if:
>       * they have the same generic identifier
>       * they have the same set of attributes
>       * corresponding attributes have the same values (after
>         expansion of entity references)
>       * they have the same number of children, and all pairs of
>         corresponding children are E-equivalent
>    and character data chunks are E-equivalent if they consist of
>    the same sequence of 'characters' (which does not mean they are
>    in the same encoding).  This may differ slightly in detail from
>    the definition of ESIS in the Corrigendum, but not intentionally.
>  - EE-ESIS-equivalent, if they are ESIS equivalent and the beginnings
>    and endings of external entities occur at the same places in the
>    ESIS.  (This isn't well phrased; I hope it's reasonably clear
>    what I mean:  ESIS equivalent, plus division into external entities
>    is preserved.)
>I need some examples:
>1 <p>Hello, world!</p> is byte equivalent to <p>Hello, world!</p>
>and to the equivalent in EBCDIC (or in ASCII, I should say, because
>as I write this, it *is* in EBCDIC.
>2 If we have <!ENTITY greet 'Hello'> and <!ENTITY someone 'world'>
>then <p>Hello, world!</p> is ESIS-equivalent, but not character
>equivalent, to <p  >&greet;, &someone;!</>
>3 A document divided into one file per chapter is ESIS-equivalent, but
>not EE-ESIS-equivalent, with the same document in a single file (i.e.
>with all references to external entities resolved).
>If two instances have the same DTD and are byte- or
>character-equivalent, they are guaranteed ESIS-equivalent, but not vice
>versa.  If they are external-entity equivalent and have the same DTD,
>they are ESIS-equivalent; I can't think of a way to preserve entity
>equivalence as defined here without preserving ESIS equivalence.
>II DTD equivalence
>Now for DTD equivalence.  Normally we think of this in terms of the set
>of effective declarations, after resolution of entity references and
>marked sections, etc.  Here, though, I think we should focus on the
>languages defined by two DTDs.  We need to clarify what we mean by
>'equivalent expressive power'.
>In general (the computer scientists among us may correct me freely) I
>think grammars are equivalent if they define the same languages, i.e. if
>they accept the same strings as members of the language and reject the
>same non-members.  One formalism F has the same power as another
>formalism G if for each grammar written using F, an equivalent grammar
>may be written using G, and vice versa.  F is more powerful than G
>if the set of languages definable by F's grammars is a proper superset
>of the set of languages definable by G's grammars.
>Now, if XML's DTDs are to be 'as powerful' as SGML's, they have to
>be able to define the same set of languages:  for every SGML DTD there
>must be an equivalent XML DTD accepting and rejecting exactly the same
>set of valid and invalid document instances.  This, we cannot do,
>if we are set on doing without CDATA and RCDATA elements.  (Or even
>On the other hand, we can say that for our purposes XML DTDs and SGML
>DTDs have 'essentially' the same expressive power if for any SGML
>DTD, there is an XML DTD such that:
>  - for every document instance accepted by the SGML DTD, the XML
>    DTD accepts a document which is 'equivalent'
>  - for every document instance accepted by the XML DTD, the SGML
>    DTD accepts a document which is 'equivalent'.
>We can choose, I think, to require that the documents be ESIS- or
>EE-ESIS-equivalent, or apply some other test, depending on how strict we
>want to be.  We could also insist on byte-equivalence or
>character-equivalence, but if we do so we can't make any real changes to
>Dropping DATACHAR and RANK and SHORTREF don't bother people, I think,
>because we're all used the notion of normalizing SGML documents, and the
>normalized forms of documents share ESIS with their originals, but
>aren't byte-equivalent.
>If we drop CDATA and RCDATA elements, we can meet the same standard:
>any CDATA element in an accepted SGML instance can be replaced, in XML,
>with the same thing, but with < and & escaped using any of the various
>possibilities discussed earlier.  If we drop the & connector from the
>content model syntax, as I suspect some people would like to do, then we
>haven't changed the expressive power of the language because (in theory)
>it is always possible to replace a content model using & with an
>equivalent one not using &.  It may be ugly and very very long -- and
>for these reasons no one is likely to do it -- but it's *possible*.
>Dropping EMPTY, on the other hand, would seriously affect the expressive
>power of XML DTDs, since we could not guarantee the second
>DTD-equivalence clause above.
>III What is needed?
>I think we should shoot for the following:
> - any existing SGML document can be translated into an EE-ESIS
>equivalent document (i.e. same ESIS, same gross entity structure, but
>there may be more, or fewer, internal entities and references)
> - any existing SGML DTD can be translated into an XML DTD which
>recognizes a set of documents EE-ESIS-equivalent to the original DTD.
> - a document translated from SGML into XML into SGML should be
>EE-ESIS-equivalent to the original; same for one that goes X-S-X.
>I don't think we should shoot for more than EE-ESIS equivalence after
>round-trip transmission.  Bytes will change.  References to internal
>entities will be resolved, or introduced.  That means there may be
>additional entity declarations in the DTD, and thus we can't even
>guarantee that the document will parse with the same DTD.  I think we
>can guarantee that it will parse with the same set of effective element,
>attlist, and notation declarations.
>-C. M. Sperberg-McQueen