[Prev][Next][Index][Thread]

more on equivalences and round-trip integrity



Thinking about the problem of document and DTD equivalence on the train
going home last night, I realized my long posting of last night didn't
deal with a couple of important issues.

I Further Types of Instance Equivalence

To discuss them usefully, I need to propose names for two more types of
document equivalence.  Document instances are

  - isomorphic, if they can be made ESIS-equivalent by consistently
    renaming generic identifiers and attributes (and possibly by global
    changes on tokens in attribute values ...) in one document, while
    retaining all the distinctions made in that document -- i.e. two
    elements have the same GI / two attributes have the same name /
    two attribute values are the same after the renaming if and only
    if they were the same beforehand.  I think this is the same as
    saying two documents are isomorphic if there is a one-to-one mapping
    between their sets of GIs and attribute names, and we can use
    that mapping to make the documents ESIS-equivalent.
  - unifiable (or structurally equivalent), if they can be made
    ESIS-equivalent by a consistent renaming in one or both documents
    that does not necessarily preserve all original distinctions.
    This boils down to saying the documents have the same number of
    elements, corresponding elements have the same number of children,
    and there is a mapping between their name spaces that has one:one,
    one:many, many:one, but no many:many relations.

If two unifiable documents A and B can be made ESIS-equivalent by
renaming things just in B, then A *subsumes* B; roughly, if A subsumes B
then they are structurally similar and B may have more information
because it may make more distinctions.  A really useful definition of
subsumption of attribute values will probably allow B to have some
attributes that A doesn't have, and may have subtler rules about
attribute values, but I think we can postpone that for a while.

If each document subsumes the other, they are isomorphic.  If they are
isomorphic and no renaming is necessary at all, they are
ESIS-equivalent.

Isomorphic and unifiable document pairs will become important if we wish
to consider changes to the DTD language like
  - elimination of inclusion exceptions (this would have an effect
    on our RE/RS discussions, but killing inclusions to solve RE is
    cracking a walnut with a steam engine)
  - elimination of exclusion exceptions
  - allowing XML DTDs to accept arbitrary regular expressions, not just
    deterministic ones
  - requiring external entities to be synchronous with the element
    structure (i.e. requiring start-tag and end-tag for an element to
    occur in the same entity, or at least requiring both to be either
    in the document entity or in the same external entity)

If we want to hold fast to the two-way-validation test for DTD
equivalence I proposed last night (for each SGML DTD there is an
equivalent XML DTD that accepts a set of documents such that each member
of the SGML document set maps to an equivalent document in XML and vice
versa, using some measure of document-instance equivalence), then
limiting the use of exceptions, or requiring synchronous entity
boundaries, would require weakening the equivalence test for document
instances.


II Inclusion and Exclusion Exceptions

Approach A:  two-way validation and unifiable documents

For any SGML DTD that has inclusion or exclusion exceptions, we can
construct (in a potentially laborious calculation) a DTD without
exceptions such that each document accepted by the original DTD subsumes
some document accepted by the modified DTD.  Some of the original GIs
might need to be mapped into two or more GIs in the subsumed document.

For example:

  in the TEI, an inclusion exception on TEXT allows a linebreak element
  LB to occur anywhere within the TEXT element.  Since P (paragraph) can
  occur both inside TEXT and outside it (in the TEI header), this means
  an exception-less DTD will need two types of P (or more, if there are
  other possible locations of P which have different sets of effective
  exceptions:  the number of possible effective exception states will be
  somewhere between 2 ** N and N!, if N elements have exceptions in
  their declarations; it's 2**N if no GI occurs both as an inclusion
  and an exclusion, I think).

  If we call them P (in the text) and Ph (in the header), then we can
  create a DTD without exceptions which will accept precisely those
  documents which can be mapped back into the original TEI DTD by
  renaming all Ph elements as P.

Translation back into SGML can map the new GIs back into the original
GI, of course, if we could guarantee that the XML-SGML translator knew
about the original SGML-XML translation.  A normal process of reading
the XML document as an SGML document would generate a different DTD from
the original.

Approach B:  two-way validation with ESIS-equivalent documents, by
eliminating exceptions in the SGML

In practice, of course, the advantage of normalizing an SGML document
into XML and having it be valid SGML using the same DTD as the original
is so great that many people, myself included, will be tempted to remove
exceptions from their SGML DTDs in order to have our SGML and XML DTDs
cover sets of EE-ESIS-equivalent documents, rather than just sets of
unifiable documents.  (This would also allow us to avoid having to
explain subsumption to a lot of people, which I admit is a modest gain.)

This would presumably gladden the hearts of monastic SGML-ers.

Approach C:  ESIS-equivalent documents, over- and under-generation

Given an SGML DTD with exceptions (call it X), it should also be
possible (should, meaning I haven't worked it all out yet) to generate
two DTDs (W and Y) which underspecify or overspecify the language
accepted by X, meaning:
  - every document accepted by X is accepted by W but not necessarily
    vice versa (W is 'wider'?) -- W underspecifies, and overgenerates,
    the language
  - every document accepted by Y is accepted by X but not necessarily
    vice versa -- Y overspecified, and undergenerates, the language

So translating an SGML document into XML using DTD W will ensure that
the XML user has all the freedom offered by the original DTD X, with
exceptions, while translating it into Y (if possible) will guarantee
that if I edit it with a validating XML editor, it will still be valid
according to X when I give it back to you.

I don't know whether being able to generate W and Y will help address
concerns like Paul Grosso's or not.

The key point is that if we wish to have the *option* of removing
exceptions from the DTD notation of XML, we cannot insist on the EE-ESIS
equivalence test in conjunction with the two-way-validation test I
described last night.  One or the other has got to be weakened; I'd
prefer that we weaken the instance-equivalence test, but only in certain
well-defined cases (such as DTDs with exceptions or instances with
asynchronous entities), and not weaken the two-way-validation test.


III Asynchronous entities

I think I've heard it proposed that all elements begin and end in the
same entity, rather than allowing entities the freedom they currently
have to be asynchronous.  This sometimes takes the form of requiring all
elements to begin and end in the document entity itself (i.e.
forbidding references to external entities, except perhaps by way of
attribute values).

The motive, I think, is to make it easier to parse and validate entities
separated from their references.

I don't want to take a firm side on this now, since I don't know all the
arguments on both sides; from what I know, I think it's a useful
simplification if it doesn't wreak too much havoc for other users.  For
me personally it's not an issue since I don't use asynchronous entities
anyway; they confuse me.  I believe Author/Editor users won't have a
problem either, since I think A/E enforces this rule already.  I don't
know about other editors.  Perhaps the vendors will say a word?

If we choose to make XML require synchronous entities, then we won't be
able to sustain a guarantee of EE-ESIS-equivalence for documents which
currently have asynchronous entities.  That would seem to push us back
to ESIS-equivalence in all cases, and EE-ESIS equivalence for some
documents.

IV  What's needed

I think we should allow ourselves to fall short of the goal of EE-ESIS
equivalence in some cases, if we think they are not too burdensome on
users and help implementation.  In the specific case of content-model
exceptions, I think eliminating them is enough of a gain in simplicity
that the cost (more complex SGML-XML translations, or loss of expressive
power in DTDs) is worth it.

-C. M. Sperberg-McQueen