Re: XML, HTML, SGML, life, the universe, and everything

On Fri, 08 Nov 1996 12:17:15 -0800, Tim Bray <tbray@textuality.com> wrote:

>1. What is XML For?
>Jon has pointed out that by W3C statute, the primary goal of the ERB
>and WG work is to enable the delivery of SGML over the Web.  Which is
>true.  However, I and several others on both the ERB and the WG 
>(including I think Charles Goldfarb) have an entirely un-hidden agenda,
>namely to construct a lightweight, lean, mean, easy-to-learn on-ramp
>for SGML.  

Not just an on-ramp, in my case, but a genuine subset. That means that a valid
XML 1.0 document must be a conforming ISO 8879:86 document.

>3. XML and existing SGML tools
>Existing SGML tools can, if they're compliant, read XML today modulo
>*only* overlapping enumerated attribute values and perhaps some 
>mild inconsistencies as to which RE's are where.  

The former makes XML *not* conforming SGML. The latter could avoid being
nonconforming if the spec were written carefully (see below).

> XML is a subset of SGML in spirit and in fact, and
>XML is designed so that the adjustments required in existing SGML tools
>are trivial.

The above statement is inconsistent on its face. If XML is a subset of SGML "in
fact", why should existing "compliant" SGML tools require any adjustments? One
can argue (but not here, please) whether a DTD-less markup language is a subset
of SGML in spirit, but to be one in fact means that conforming XML documents
must conform to 8879. At present, they don't.

However, XML can be made conforming (without changing its present RE handling)
as follows:

1. The spec must carefully distinguish the "conforming subset" SGML syntax from
the other things that XML defines, but which have nothing to do with generic
SGML parsing. So far, those other things are:

a. Character encoding rules for storage objects.

b. A mandatory unvarying SGML declaration. It declares the XML concrete syntax,
feature set, etc.

c. A mandatory unvarying "XML component" of every DTD. It consists of some
entity declarations, a short reference mapping declaration, some element type
declarations for empty element types, and a data content notation for white
space handling (discussed below). It is considered to occur at the start of the
internal subset, meaning that nothing in it can be preempted.

d. Some attribute definitions for use in DTDs (notably those having to do with
white space handling).

e. Prescribed semantic (i.e., not parsing) behavior of XML processors, such as
browsers and editors. This would include rules for processor behavior when an
XML document is non-conforming (not valid), but possibly well-formed, such as in
the absence of a DTD.

2. XML's white space handling should be defined as a data content notation
(XMLWS), applicable to the document element. (Suitable shortref mappings can
prevent normal SGML RE handling from taking place.) The behavior of the notation
can be dependent on attribute values and/or element types, so existing XML white
space behavior and/or other useful behaviors could be supported. (It is
possible, incidentally, that full SGML applications might want to use this

3. XML 1.0 would have to drop enumerated attribute values. As has been pointed
out, the uniqueness constraint is unjustifiable in the absence of shorttag
minimization. Declare them as CDATA and include a comment constraining the
values (perhaps expressed as a regexp). All that is lost is parser-level
validation. We've exhibited great restraint in more important areas to maintain
SGML conformance; it would be unfortunate to abandon conformance just for this