- From: Michael Sperberg-McQueen <U35395@UICVM.CC.UIC.EDU>
- Date: Wed, 09 Oct 96 16:34:31 CDT
- To: W3C SGML Working Group <w3c-sgml-wg@w3.org>
The discussions of entities and conditional inclusion seem to me to suggest we may need clarification of some issues, raised by questions A8 and A17. We might disagree over these because of different views on whether entities, or conditional inclusion, are needed or desirable for XML as a mechanism for publication on wide-area networks. We might disagree because of different views on whether they are essential for document management in production work on any reasonably large body of documents. Or we might disagree because of different views on whether XML's task is solely to support network distribution and publication of documents, or to support, as far as possible, production work in managing those documents. There doesn't seem to be a lot of need for discussion on the first point. I think HTML demonstrates that neither external text entities, nor conditional inclusion in the DTD, is essential to wide deployment and acceptance of a system for network distribution of documents. Some may think HTML's lack of each of these is a weak point, and it would be better to have them, but we can't reasonably claim that 'No one will accept a document markup language that doesn't have external entities', any more than we could claim 'No one will accept a markup language that requires lots of angle brackets.' (I used to hear that a lot. I don't, so much, anymore. Hmm.) The second point may be more controversial. It seems so obvious to me that conditional inclusion and external text entities are essential for acceptable document management that I have a hard time making a serious argument; I tend to collapse into sputtering incoherence. But in case anyone really thinks they're not needed, I'll try. On conditional inclusion, I'll just note that there seem to be quite a few public DTDs which have found it necessary to have more than one flavor, and which use conditional inclusion of declarations to accomplish that feat. In some cases, there are just two flavors; in the case of the TEI, it's something like a few hundred thousand flavors (not counting variations caused by suppressing individual elements; if you count those, there are probably 2**400 or so flavors). For production work, it seems better if we can generate the required flavors from a single copy of the DTD, rather than keeping a copy of each flavor. This simplifies updates, too. The TEI may be unusual in its extreme variety, but even HTML has multiple flavors controlled by marked sections. On external entities: (1) External entities allow me to divide a document up into convenient chunks for editing, for exchange with others working on the same project, for check-in and check-out, version control, etc. Any division into files does this. (2) Keeping entity syntax allows me to express, in a standard way, how the different entities of which a document is composed fit together. The 'cat them all together then call the parser' technique does NOT allow me to do this: I lose the ability to document, in a standard way, how the entity structure of the document works. (It also forces me to split my root entity asynchronously in a way I normally avoid.) (3) In 8879, entities referred to are parsed in context, and the structure of the document as a whole is validated by the parser. The "just refer to them from an empty element' technique makes all my external entities opaque to validation: even if they are valid when viewed in isolation, this technique offers no guarantee that the elements in the external entities are valid at the point of reference. If I want to validate the external entities at all, this technique also limits me to single-element external entities, which is not necessarily exactly what I want. Validation using a document grammar is the jewel of SGML. I don't want to be forced to do without it. Reference to external entities via attribute values (which amounts to the reinvention of the GML Starter Set's INCLUDE tag, twenty-odd years later) also shifts responsibility for the include-and-handle-here behavior from the parser for the language to the application. In general, whenever there is some constraint which must be honored by every application, or some behavior which every application must perform, there's a good case for making the constraint expressible in the language itself, and making the basic processor responsible for the behavior. In databases, constraints on the data should be expressed in the schema and enforced by the DBMS. Leaving them to be enforced by every application programmer who touches the database for read or write is, in general, not a good idea. In SGML we have a well understood way of saying "That thing over there is part of this document; it goes HERE", using references to external entities. It's appropriate that it be in the markup language, not just in the applications' style sheet languages, since processing a document generally requires that we know what is and what is not part of the document. Moving that knowledge outside of the markup language is NOT a step forward in the history of document representation. (4) External entities can be (at least, I think they can) not just files but any data stream. External entities, that is, make it possible to build what Dave Sklar calls 'spontaneously combusting' documents: documents whose external entities are data streams created on demand, at parse time, and thus guaranteed up to date. Take away external entities, and how are we to do that? Even if everyone agrees that these constructs / features are (a) required for serious production work and (b) not required for net-based distribution, we may still disagree over whether they belong in XML. XML could be SGML-for-clients: nothing there that isn't essential to allow a client to parse it. Or it could be a more serious language, a flavor of SGML stripped down enough to allow easier implementation and make it feasible to implement in a client, but strong enough that a lot of serious work can be done in it, so that most of us could use it, most of the time, and publication on the net would not ALWAYS involve a serious down-translation and loss of information. On the whole, I'd rather have XML be a strong, useful language, not limited to use in network publishing. That's what I think Goal 2 is for. If most of us, most of the time, need more than XML will provide and will have to do our daily production work in Full SGML, then what will XML have bought us? A slightly better publication medium than HTML 2.0 or HTML 3.2? We don't need a group this high-powered to do that: the mountains give birth, and bring forth a mouse? In short: I think we both conditional inclusion of DTD fragments and normal SGML-style support for external text entities are essential, and belong in XML. If we want to specify that servers should expand references to external entities before serving to clients, that's OK by me, but we may want to look for other ways to specify whether inclusions should be done on the server side or the client side. Either way, the syntax for references to external text entities (and their declarations) needs to be in XML, unless we are content with a niche language when we could have a stronger one. -C. M. Sperberg-McQueen
Received on Wednesday, 9 October 1996 18:34:55 UTC