- From: James Clark <jjc@jclark.com>
- Date: Mon, 07 Oct 1996 15:10:17 +0000
- To: W3C SGML Working Group <w3c-sgml-wg@w3.org>
I am only going to address entities in the document instance in this note. Consideration of other kinds of entity will need to wait until the syntax for prologs as been decided upon. I think different kinds of entities need to be considered separately. 1. External entities 1a. External text entities These are the most problematic kind of entity. I believe these very substantially complicate implementation. (i) With these you can no longer do useful processing with simple 5-line Perl scripts. (ii) It radically increases the complexity of interfacing a parser to an application. Without external text entities, you can just give a parser a file or a string and get back a tree or a sequence of events. With external text entities, the parser has to be able to call back into the application to get the contents of declared entities. In particular I think external text entities would make it much harder to interface XML parsers to typical Web browsers. (iii) The parser no longer reads from a single input stream: it has to have a stack of input streams and be able to switch between them. (iv) I think it's hard to write SGML editors that allow good control over entity structure at the same time as element structure. (v) External text entities make it much harder to do transformations on XML documents. Without external text entities, transformations can work as a filter that take a single file as input and write a single file as output, and so transformations can be easily pipelined together. If transformations must work work with and preserve external text entities, this sort of approach is no longer possible. What is the functionality that external text entities are providing? - The most important is a mechanism for reuse. But I would suggest that it's better to handle reuse by making a self-contained, independently parseable XML document of whatever it is you want to reuse, and having an element whose semantics are to include that document in context. Of course, this requires some way to express that semantic and ways to handle hyperlinking between independent documents; but these are problems we are going to have to solve anyway. - External text entities also provide a mechanism for subdividing a document into separate storage objects for convenience in, for example, version control or text editing. I think there are much more powerful approaches to this sort of problem. For example, if I keep the document in some sort of OO database, I can dynamically make an entity out of whatever fragment of the element structure I want. I think including support for external text entities makes it much harder to support these more powerful approaches. 1b. External data entities These are much less troublesome than text entities: a parser just has to pass the external identifier and notation to the application. However, I think the Web community is so used to the simplicity and convenience of being able to stick a URL directly in an attribute value that they are not going to buy into the idea that they should instead - make up an entity name; - include a separate entity declaration; - include a separate notation declaration; - create an SGML Open catalog. Using external data entities also has the substantial disadvantage that each document typically ends up requiring its own DTD subset that declares the external data entities it needs. I think this will make DTD-less parsing much harder. The only extra information that using external data entities provides is the notation, but this is marginally useful in an HTTP world where that information is in the MIME header. Information in data attributes can just as well be put as extra attributes on the element that includes the URL attribute value. 2. Internal entities 2a. Internal CDATA entities I think these are useful especially for allowing the use of readable entity names instead of numeric character references. Since they can contain neither elements nor nested entity references, they complicate implementation very little. They also don't crop up in the ESIS, so they don't complicate the data model. On the whole I would be in favour of keeping these, maybe with the restriction that they must contain exactly one character. 2b. Internal SDATA entities Given that we can have Unicode character references, and given that Unicode has a private use zone, and given internal CDATA entities, I don't think these offer any really useful functionality. They also complicate the interface: with internal SDATA entities, you can't treat attribute simply as strings (unless you application maps them onto characters). 2c. Internal PI entities Probably insufficiently useful to justify the complexity. 2d. Internal text entities These have some of the implementation problems that external text entities have. I would be against including them (but not as strongly as I am against external text entities). In summary, the more I think about it, the more I am convinced that XML will fail if it includes external text entities. I would recommend having exactly one kind of entity, namely internal CDATA entities, which is, in fact, exactly what HTML does. James
Received on Monday, 7 October 1996 10:16:08 UTC