Re: SGML and XML

First, we have a compriomise on this issue that is far from perfect, but
took us at least a month of intense debate to reach: why are we re-opening
it? Let's leave not-so-well-enough alone! I'm wistful for a better
(different) solution, but given that everyone (including me) seems to find
the compromise livable, shouldn't we leave it in place?

At 11:04 AM 12/1/96, Michael Sperberg-McQueen wrote:
>I think I agree that simplifying RE/RS handling would be good, but
>in the interests of clarity let me ask:  Do you mean we have to make
>the DTD irrelevant to white-space handling
>
>    * because that was decided at some point, or
   Yes.
>    * as a direct consequence of some other decision, or
   Yes, as well: if this is not the case, then applications will produce
different results depending on whether a DTD is present or not.

>    * because that would be a cleaner design?
   I dunno, none of the proposals we've considered leads to a "clean
design" other than making all whitespace significant (a policy that has
been vetoed for compatibility reasons).

>I agree that it would be desirable to make rules which allow
>processors to skip the DTD and still get a wholly correct parse
>tree, er, grove, but I don't think that's possible, since a
>processor cannot resolve entity references or provide correct
>default attribute values without reading the DTD.

We already determined that resolving entity references is not required: the
XML data structure cannot completely hide entity references from
applications in that way that SGML parsers do. Default attributes at least
have fixed values, so they must offer relatively little to an application
that doesn't care about DTDs.

>I also believe that if XML makes no distinction between element
>content and mixed content, it will be impossible to build correct
>XML processors on top of SGML parsers.
We did determine, I think, that some differences would exist between the
SGML and XML "data streams" in the case of RE near markup. This was deemed
acceptable compared to the horrendously complex SGML status quo.

>It might be worthwhile to distinguish clearly among
>  (a) full validation, which checks the instance against the DTD and
>provides the correct grove,
>  (b) full parsing, which doesn't validate but does provide
>character-for-character the correct grove, and which requires at
>least some attention to the DTD, and
>  (c) minimal or partial parsing, which doesn't require a DTD and
>which is guaranteed (in XML) to vary from the correct grove only in
>specific ways:
>    (i) it will have extra white space in element content,
>    (ii) it won't provide correct values for defaulted attributes,
>    (iii) it won't have information about attribute types like
>ID/IDREF.
>    (iv) it won't resolve references to entities not hardcoded into
>the processor.

This is far to complex: two kinds of parsing are enough.

>I'm not sure (iii) is part of the standard grove output on which we
>want to model the XML processor's output; if it isn't, then that's
>one difference fewer.

   I don't know from groves, but if we repeat the mistakes of the ESIS on
ID/IDREF (making them into crypto-types that the application can't see), we
are truly foolish. for the same reason we need to make sure that empty
elements are treated as a distinct type from elements with no content:
another ESIS mistake that I thought we had forsworn. Or are we going to
give up trivial XML->XML filtering?


>-C. M. Sperberg-McQueen

I am not a number. I am an undefined character.
_________________________________________
David Durand              dgd@cs.bu.edu  \  david@dynamicDiagrams.com
Boston University Computer Science        \  Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/   \  Dynamic Diagrams
--------------------------------------------\  http://dynamicDiagrams.com/
MAPA: mapping for the WWW                    \__________________________

Received on Monday, 2 December 1996 16:02:00 UTC