Re: A17: keep or drop entities? from James Clark on 1996-10-07 (w3c-sgml-wg@w3.org from October 1996)

From: James Clark <jjc@jclark.com>
Date: Mon, 07 Oct 1996 15:10:17 +0000
To: W3C SGML Working Group <w3c-sgml-wg@w3.org>
Message-Id: <2.2.32.19961007151017.0092958c@jclark.iserver.com>
I am only going to address entities in the document instance in this note.
Consideration of other kinds of entity will need to wait until the syntax
for prologs as been decided upon.

I think different kinds of entities need to be considered separately.

1. External entities

1a. External text entities

These are the most problematic kind of entity.  I believe these very
substantially complicate implementation.

(i) With these you can no longer do useful processing with simple 5-line
Perl scripts.

(ii) It radically increases the complexity of interfacing a parser to an
application.  Without external text entities, you can just give a parser a
file or a string and get back a tree or a sequence of events.  With external
text entities, the parser has to be able to call back into the application
to get the contents of declared entities.  In particular I think external
text entities would make it much harder to interface XML parsers to typical
Web browsers.

(iii) The parser no longer reads from a single input stream: it has to have
a stack of input streams and be able to switch between them.

(iv) I think it's hard to write SGML editors that allow good control over
entity structure at the same time as element structure.

(v) External text entities make it much harder to do transformations on XML
documents.  Without external text entities, transformations can work as a
filter that take a single file as input and write a single file as output,
and so transformations can be easily pipelined together.  If transformations
must work work with and preserve external text entities, this sort of
approach is no longer possible.

What is the functionality that external text entities are providing?

- The most important is a mechanism for reuse.  But I would suggest that
it's better to handle reuse by making a self-contained, independently
parseable XML document of whatever it is you want to reuse, and having an
element whose semantics are to include that document in context.  Of course,
this requires some way to express that semantic and ways to handle
hyperlinking between independent documents; but these are problems we are
going to have to solve anyway.

- External text entities also provide a mechanism for subdividing a document
into separate storage objects for convenience in, for example, version
control or text editing.  I think there are much more powerful approaches to
this sort of problem.  For example, if I keep the document in some sort of
OO database, I can dynamically make an entity out of whatever fragment of
the element structure I want.  I think including support for external text
entities makes it much harder to support these more powerful approaches.

1b. External data entities

These are much less troublesome than text entities: a parser just has to
pass the external identifier and notation to the application. However, I
think the Web community is so used to the simplicity and convenience of
being able to stick a URL directly in an attribute value that  they are not
going to buy into the idea that they should instead

- make up an entity name;

- include a separate entity declaration;

- include a separate notation declaration;

- create an SGML Open catalog.

Using external data entities also has the substantial disadvantage that each
document typically ends up requiring its own DTD subset that declares the
external data entities it needs.  I think this will make DTD-less parsing
much harder.

The only extra information that using external data entities provides is the
notation, but this  is marginally useful in an HTTP world where that
information is in the MIME header.  Information in data attributes can just
as well be put as extra attributes on the element that includes the URL
attribute value.

2. Internal entities

2a. Internal CDATA entities

I think these are useful especially for allowing the use of readable entity
names instead of numeric character references.  Since they can contain
neither elements nor nested entity references, they complicate
implementation very little.  They also don't crop up in the ESIS, so they
don't complicate the data model.  On the whole I would be in favour of
keeping these, maybe with the restriction that they must contain exactly one
character.

2b. Internal SDATA entities

Given that we can have Unicode character references, and given that Unicode
has a private use zone, and given internal CDATA entities, I don't think
these offer any really useful functionality.  They also complicate the
interface: with internal SDATA entities, you can't treat attribute simply as
strings (unless you application maps them onto characters).

2c. Internal PI entities

Probably insufficiently useful to justify the complexity.

2d. Internal text entities

These have some of the implementation problems that external text entities
have.  I would be against including them (but not as strongly as I am
against external text entities).

In summary, the more I think about it, the more I am convinced that XML will
fail if it includes external text entities.  I would recommend having
exactly one kind of entity, namely internal CDATA entities, which is, in
fact, exactly what HTML does.

James
Received on Monday, 7 October 1996 10:16:08 UTC