- From: Michael Sperberg-McQueen <U35395@UICVM.CC.UIC.EDU>
- Date: Tue, 17 Sep 96 11:36:26 CDT
- To: W3C SGML Working Group <w3c-sgml-wg@w3.org>
Thinking about the problem of document and DTD equivalence on the train going home last night, I realized my long posting of last night didn't deal with a couple of important issues. I Further Types of Instance Equivalence To discuss them usefully, I need to propose names for two more types of document equivalence. Document instances are - isomorphic, if they can be made ESIS-equivalent by consistently renaming generic identifiers and attributes (and possibly by global changes on tokens in attribute values ...) in one document, while retaining all the distinctions made in that document -- i.e. two elements have the same GI / two attributes have the same name / two attribute values are the same after the renaming if and only if they were the same beforehand. I think this is the same as saying two documents are isomorphic if there is a one-to-one mapping between their sets of GIs and attribute names, and we can use that mapping to make the documents ESIS-equivalent. - unifiable (or structurally equivalent), if they can be made ESIS-equivalent by a consistent renaming in one or both documents that does not necessarily preserve all original distinctions. This boils down to saying the documents have the same number of elements, corresponding elements have the same number of children, and there is a mapping between their name spaces that has one:one, one:many, many:one, but no many:many relations. If two unifiable documents A and B can be made ESIS-equivalent by renaming things just in B, then A *subsumes* B; roughly, if A subsumes B then they are structurally similar and B may have more information because it may make more distinctions. A really useful definition of subsumption of attribute values will probably allow B to have some attributes that A doesn't have, and may have subtler rules about attribute values, but I think we can postpone that for a while. If each document subsumes the other, they are isomorphic. If they are isomorphic and no renaming is necessary at all, they are ESIS-equivalent. Isomorphic and unifiable document pairs will become important if we wish to consider changes to the DTD language like - elimination of inclusion exceptions (this would have an effect on our RE/RS discussions, but killing inclusions to solve RE is cracking a walnut with a steam engine) - elimination of exclusion exceptions - allowing XML DTDs to accept arbitrary regular expressions, not just deterministic ones - requiring external entities to be synchronous with the element structure (i.e. requiring start-tag and end-tag for an element to occur in the same entity, or at least requiring both to be either in the document entity or in the same external entity) If we want to hold fast to the two-way-validation test for DTD equivalence I proposed last night (for each SGML DTD there is an equivalent XML DTD that accepts a set of documents such that each member of the SGML document set maps to an equivalent document in XML and vice versa, using some measure of document-instance equivalence), then limiting the use of exceptions, or requiring synchronous entity boundaries, would require weakening the equivalence test for document instances. II Inclusion and Exclusion Exceptions Approach A: two-way validation and unifiable documents For any SGML DTD that has inclusion or exclusion exceptions, we can construct (in a potentially laborious calculation) a DTD without exceptions such that each document accepted by the original DTD subsumes some document accepted by the modified DTD. Some of the original GIs might need to be mapped into two or more GIs in the subsumed document. For example: in the TEI, an inclusion exception on TEXT allows a linebreak element LB to occur anywhere within the TEXT element. Since P (paragraph) can occur both inside TEXT and outside it (in the TEI header), this means an exception-less DTD will need two types of P (or more, if there are other possible locations of P which have different sets of effective exceptions: the number of possible effective exception states will be somewhere between 2 ** N and N!, if N elements have exceptions in their declarations; it's 2**N if no GI occurs both as an inclusion and an exclusion, I think). If we call them P (in the text) and Ph (in the header), then we can create a DTD without exceptions which will accept precisely those documents which can be mapped back into the original TEI DTD by renaming all Ph elements as P. Translation back into SGML can map the new GIs back into the original GI, of course, if we could guarantee that the XML-SGML translator knew about the original SGML-XML translation. A normal process of reading the XML document as an SGML document would generate a different DTD from the original. Approach B: two-way validation with ESIS-equivalent documents, by eliminating exceptions in the SGML In practice, of course, the advantage of normalizing an SGML document into XML and having it be valid SGML using the same DTD as the original is so great that many people, myself included, will be tempted to remove exceptions from their SGML DTDs in order to have our SGML and XML DTDs cover sets of EE-ESIS-equivalent documents, rather than just sets of unifiable documents. (This would also allow us to avoid having to explain subsumption to a lot of people, which I admit is a modest gain.) This would presumably gladden the hearts of monastic SGML-ers. Approach C: ESIS-equivalent documents, over- and under-generation Given an SGML DTD with exceptions (call it X), it should also be possible (should, meaning I haven't worked it all out yet) to generate two DTDs (W and Y) which underspecify or overspecify the language accepted by X, meaning: - every document accepted by X is accepted by W but not necessarily vice versa (W is 'wider'?) -- W underspecifies, and overgenerates, the language - every document accepted by Y is accepted by X but not necessarily vice versa -- Y overspecified, and undergenerates, the language So translating an SGML document into XML using DTD W will ensure that the XML user has all the freedom offered by the original DTD X, with exceptions, while translating it into Y (if possible) will guarantee that if I edit it with a validating XML editor, it will still be valid according to X when I give it back to you. I don't know whether being able to generate W and Y will help address concerns like Paul Grosso's or not. The key point is that if we wish to have the *option* of removing exceptions from the DTD notation of XML, we cannot insist on the EE-ESIS equivalence test in conjunction with the two-way-validation test I described last night. One or the other has got to be weakened; I'd prefer that we weaken the instance-equivalence test, but only in certain well-defined cases (such as DTDs with exceptions or instances with asynchronous entities), and not weaken the two-way-validation test. III Asynchronous entities I think I've heard it proposed that all elements begin and end in the same entity, rather than allowing entities the freedom they currently have to be asynchronous. This sometimes takes the form of requiring all elements to begin and end in the document entity itself (i.e. forbidding references to external entities, except perhaps by way of attribute values). The motive, I think, is to make it easier to parse and validate entities separated from their references. I don't want to take a firm side on this now, since I don't know all the arguments on both sides; from what I know, I think it's a useful simplification if it doesn't wreak too much havoc for other users. For me personally it's not an issue since I don't use asynchronous entities anyway; they confuse me. I believe Author/Editor users won't have a problem either, since I think A/E enforces this rule already. I don't know about other editors. Perhaps the vendors will say a word? If we choose to make XML require synchronous entities, then we won't be able to sustain a guarantee of EE-ESIS-equivalence for documents which currently have asynchronous entities. That would seem to push us back to ESIS-equivalence in all cases, and EE-ESIS equivalence for some documents. IV What's needed I think we should allow ourselves to fall short of the goal of EE-ESIS equivalence in some cases, if we think they are not too burdensome on users and help implementation. In the specific case of content-model exceptions, I think eliminating them is enough of a gain in simplicity that the cost (more complex SGML-XML translations, or loss of expressive power in DTDs) is worth it. -C. M. Sperberg-McQueen
Received on Tuesday, 17 September 1996 14:54:58 UTC