Re: SGML and XML from Michael Sperberg-McQueen on 1996-12-02 (w3c-sgml-wg@w3.org from December 1996)

From: Michael Sperberg-McQueen <U35395@UICVM.UIC.EDU>
Date: Mon, 02 Dec 96 15:16:29 CST
To: "David G. Durand" <dgd@cs.bu.edu>, W3C SGML Working Group <w3c-sgml-wg@w3.org>
Message-Id: <199612022259.RAA04845@www10.w3.org>
On Mon, 2 Dec 1996 16:02:35 -0500 David G. Durand said:
>First, we have a compriomise on this issue that is far from perfect,
>but took us at least a month of intense debate to reach: why are we
>re-opening it? Let's leave not-so-well-enough alone! I'm wistful for
>a better (different) solution, but given that everyone (including me)
>seems to find the compromise livable, shouldn't we leave it in place?

I think some people are interested in re-opening it because Charles
seems to have found a way to get RE Delenda Est behavior out of SGML
processors.  At least, it's the behavior I thought was intended for
REDE:  record ends are always passed through in mixed content, and
all white space is eaten in element content.  Since many people gave
up on REDE solely for compatibility reasons, Charles's new proposal
seems to open the possibility of getting something materially better
than the compromise we have now.

>At 11:04 AM 12/1/96, Michael Sperberg-McQueen wrote:
>>I think I agree that simplifying RE/RS handling would be good, but
>>in the interests of clarity let me ask:  Do you mean we have to make
>>the DTD irrelevant to white-space handling
>>
>>    * because that was decided at some point, or
>   Yes.

When was that?  I don't remember that decision, and I'd like to
refresh my memory on the wording.

Actually, I rather suspect that Lee was just failing to distinguish
his own opinions from logical necessities and from decisions made in
the design to date, and since he has scolded others on that account,
I thought I'd ask to make sure.

>>    * as a direct consequence of some other decision, or
>   Yes, as well: if this is not the case, then applications will produce
>different results depending on whether a DTD is present or not.

Yes, they will.  How is this incompatible with any decision we've made?

I don't remember a decision saying they can't produce different results,
only a series of strenuous attempts to minimize the set of differences.
It didn't go to zero, and won't while XML has attribute defaults and
entities.  Since I'd rather have attribute defaults and entities than
lose them, this seems like a good deal to me.

The goal is to let some useful applications skip the DTD, if they can.
Right now, very few useful applications can skip the DTD, because most
useful applications are guided in their actions by element boundaries,
and these cannot be reliably detected without knowing which elements
have declared content.  In XML, there are no CDATA or RCDATA elements,
and EMPTY elements are detectable without a DTD.  The only remaining
types of information for which a DTD is essential are:

  - content models (for validation)
  - attribute types and default values (for recognition of IDREFs
    etc. and for full provision of all attributes which apply to each
    element instance)
  - entity declarations (for resolution of entity references)
  - element vs. mixed content (for elimination of white space in
    element content)

I can think of useful applications which require this information; I
can also think of useful applications which don't.  Removing the
last item from the list doesn't add or shift any applications I can
think of.

>>    * because that would be a cleaner design?
>   I dunno, none of the proposals we've considered leads to a "clean
>design" other than making all whitespace significant (a policy that has
>been vetoed for compatibility reasons).

And brought up again for further discussion precisely because, as
Charles explained, we may find a compatible way to do it - at least
for mixed content.

>>I agree that it would be desirable to make rules which allow
>>processors to skip the DTD and still get a wholly correct parse
>>tree, er, grove, but I don't think that's possible, since a
>>processor cannot resolve entity references or provide correct
>>default attribute values without reading the DTD.
>
>We already determined that resolving entity references is not
>required: the XML data structure cannot completely hide entity
>references from applications in that way that SGML parsers do.
>Default attributes at least have fixed values, so they must offer
>relatively little to an application that doesn't care about DTDs.

I don't say these things are earth-shattering; I do think they
constitute differences between the parse tree produced by processors
which read the DTD and those which don't.  The parse trees aren't the
same and tweaking the rules for white-space handling in element content
won't make them the same.  So I don't see why we should depart from SGML
by saying that white space should be preserved in element content.  If
that were the only difference between DTD-less parsing and DTD-aware
parsing, we might have something to discuss, but it's not and we don't.

>>I also believe that if XML makes no distinction between element
>>content and mixed content, it will be impossible to build correct
>>XML processors on top of SGML parsers.
>We did determine, I think, that some differences would exist between
>the SGML and XML "data streams" in the case of RE near markup. This
>was deemed acceptable compared to the horrendously complex SGML
>status quo.

Not all REs near markup; only REs in pathological positions, and
they won't appear or disappear, they'll only be moved from their
actual position by SGML parsers, and left alone by XML processors.

>> �discussion of validation, production of correct ESIS, and
>> simple parsing with almost-correct ESIS�

>This is far too complex: two kinds of parsing are enough.

Which two do you choose?

I suggested we distinguish the three concepts of validatiion,
generation of correct ESIS, and minimal parsing because we seem to
be using all three of these distinct concepts while pretending to
ourselves that there are only two concepts involved.  If one of
these three is unnecessary, by all means tell us which it is so we
can do without it.

Michael
Received on Monday, 2 December 1996 17:58:39 UTC