A8 and A17: entities, conditional inclusion, what's XML for?

The discussions of entities and conditional inclusion seem to me to
suggest we may need clarification of some issues, raised by questions A8
and A17.

We might disagree over these because of different views on whether
entities, or conditional inclusion, are needed or desirable for XML as a
mechanism for publication on wide-area networks.

We might disagree because of different views on whether they are
essential for document management in production work on any reasonably
large body of documents.

Or we might disagree because of different views on whether XML's task is
solely to support network distribution and publication of documents, or
to support, as far as possible, production work in managing those
documents.


There doesn't seem to be a lot of need for discussion on the first
point.  I think HTML demonstrates that neither external text entities,
nor conditional inclusion in the DTD, is essential to wide deployment
and acceptance of a system for network distribution of documents.  Some
may think HTML's lack of each of these is a weak point, and it would be
better to have them, but we can't reasonably claim that 'No one will
accept a document markup language that doesn't have external entities',
any more than we could claim 'No one will accept a markup language that
requires lots of angle brackets.'  (I used to hear that a lot.  I don't,
so much, anymore.  Hmm.)


The second point may be more controversial.  It seems so obvious to me
that conditional inclusion and external text entities are essential for
acceptable document management that I have a hard time making a serious
argument; I tend to collapse into sputtering incoherence.  But in case
anyone really thinks they're not needed, I'll try.

On conditional inclusion, I'll just note that there seem to be quite a
few public DTDs which have found it necessary to have more than one
flavor, and which use conditional inclusion of declarations to
accomplish that feat.  In some cases, there are just two flavors; in the
case of the TEI, it's something like a few hundred thousand flavors (not
counting variations caused by suppressing individual elements; if you
count those, there are probably 2**400 or so flavors).  For production
work, it seems better if we can generate the required flavors from a
single copy of the DTD, rather than keeping a copy of each flavor. This
simplifies updates, too.  The TEI may be unusual in its extreme variety,
but even HTML has multiple flavors controlled by marked sections.

On external entities:

(1) External entities allow me to divide a document up into convenient
chunks for editing, for exchange with others working on the same
project, for check-in and check-out, version control, etc.  Any division
into files does this.

(2) Keeping entity syntax allows me to express, in a standard way, how
the different entities of which a document is composed fit together.
The 'cat them all together then call the parser' technique does NOT
allow me to do this:  I lose the ability to document, in a standard way,
how the entity structure of the document works.  (It also forces me to
split my root entity asynchronously in a way I normally avoid.)

(3) In 8879, entities referred to are parsed in context, and the
structure of the document as a whole is validated by the parser.  The
"just refer to them from an empty element' technique makes all my
external entities opaque to validation:  even if they are valid when
viewed in isolation, this technique offers no guarantee that the
elements in the external entities are valid at the point of reference.
If I want to validate the external entities at all, this technique also
limits me to single-element external entities, which is not necessarily
exactly what I want.  Validation using a document grammar is the jewel
of SGML.  I don't want to be forced to do without it.

Reference to external entities via attribute values (which amounts to
the reinvention of the GML Starter Set's INCLUDE tag, twenty-odd years
later) also shifts responsibility for the include-and-handle-here
behavior from the parser for the language to the application.  In
general, whenever there is some constraint which must be honored by
every application, or some behavior which every application must
perform, there's a good case for making the constraint expressible in
the language itself, and making the basic processor responsible for the
behavior.  In databases, constraints on the data should be expressed in
the schema and enforced by the DBMS.  Leaving them to be enforced by
every application programmer who touches the database for read or write
is, in general, not a good idea.  In SGML we have a well understood way
of saying "That thing over there is part of this document; it goes
HERE", using references to external entities.  It's appropriate that it
be in the markup language, not just in the applications' style sheet
languages, since processing a document generally requires that we know
what is and what is not part of the document.  Moving that knowledge
outside of the markup language is NOT a step forward in the history of
document representation.

(4) External entities can be (at least, I think they can) not just files
but any data stream.  External entities, that is, make it possible to
build what Dave Sklar calls 'spontaneously combusting' documents:
documents whose external entities are data streams created on demand, at
parse time, and thus guaranteed up to date.  Take away external
entities, and how are we to do that?


Even if everyone agrees that these constructs / features are (a)
required for serious production work and (b) not required for
net-based distribution, we may still disagree over whether they
belong in XML.  XML could be SGML-for-clients:  nothing there that
isn't essential to allow a client to parse it.  Or it could be
a more serious language, a flavor of SGML stripped down enough to
allow easier implementation and make it feasible to implement in a
client, but strong enough that a lot of serious work can be done in it,
so that most of us could use it, most of the time, and publication
on the net would not ALWAYS involve a serious down-translation and loss
of information.

On the whole, I'd rather have XML be a strong, useful language, not
limited to use in network publishing.  That's what I think Goal 2 is
for.  If most of us, most of the time, need more than XML will provide
and will have to do our daily production work in Full SGML, then what
will XML have bought us?  A slightly better publication medium than HTML
2.0 or HTML 3.2?  We don't need a group this high-powered to do that:
the mountains give birth, and bring forth a mouse?

In short:  I think we both conditional inclusion of DTD fragments and
normal SGML-style support for external text entities are essential, and
belong in XML.  If we want to specify that servers should expand
references to external entities before serving to clients, that's OK by
me, but we may want to look for other ways to specify whether inclusions
should be done on the server side or the client side.  Either way, the
syntax for references to external text entities (and their declarations)
needs to be in XML, unless we are content with a niche language when we
could have a stronger one.

-C. M. Sperberg-McQueen

Received on Wednesday, 9 October 1996 18:34:55 UTC