Re: XML, SGML & the Web (was: Shorthand for default attributes) from Paul Prescod on 1997-05-15 (w3c-sgml-wg@w3.org from May 1997)

From: Paul Prescod <papresco@calum.csclub.uwaterloo.ca>
Date: Thu, 15 May 1997 11:07:12 -0400
To: w3c-sgml-wg@w3.org
Message-ID: <337B26A0.1A96DB91@calum.csclub.uwaterloo.ca>
Bert Bos wrote:
>....
> If you look at a
> graph of the number of documents versus their size, you'll see a curve
> that falls off exponentially with increasing document size. This is
> not (only) due to the computer; it is the way people function.

The critical point is that we agree that there are large documents and
always will be large documents. I think we should be able to further
agree that people who edit these large documents should have the right
to have them be XML documents in the fullest sense: a single namespace,
shared entities, one DTD, one root, one hierarchy, one logical element
stream. We must support these documents. Thus we should not introduced
features that require linear scanning of documents for proper
processing. 

>...
> And the example of the encyclopedia also shows that large documents
> tend to be very regular in structure: they are databases made up of
> records. It is no coincidence that the only really large documents are
> databases. To handle things that large, people need a rigid
> structure. DBMSs deal with gigabytes pretty well, requiring a generic
> XML parser to deal with it doesn't sound reasonable to me. Instead,
> pipe the DBMS output into the XML parser and be done with it.
 
That is exactly what I am suggesting! The DBMS can be a single logical
document and a client/server system can extract XML chunks as necessary.
As long as there are no linear dependencies extracting a chunk will
never require doing a linear read of the whole DBMS. That is why we must
remove linear dependencies.

> Maybe. But how important is this compatibility? Here is a quote from
> the document you mentioned:
> 
>     3. XML shall be compatible with SGML.
> 
>        1.Existing SGML tools will be able to read and write XML data.
> 
>        2.XML instances are SGML documents as they are, without changes to
>          the instance.
> 
>        3.For any XML document, a DTD can be generated such that SGML will
>          produce "the same parse" as would an XML processor.
> 
>        4.XML should have essentially the same expressive power as SGML.
> 
>     Note: #1 and #2 describe our goal in its ideal form. If this goal is
>     not achievable in its fullest form, then we may back out to a weaker
>     form: it shall be simple to transform XML documents into equivalent
>     SGML documents, and vice versa. Our intention, however, is to bite the
>     bullet and ensure if we can that no transformation is needed to allow
>     SGML tools to read and write XML document instances.
> 
>     #3 and #4 indicate our intentions accurately, but it is not yet clear
>     how best to formalize and explain the phrase "the same parse", or the
>     phrase "essentially the same expressive power". These remain open
>     questions; see point 8 also.
> 
> Clearly points 1 and 2 are not met, so, according to the note, the
> spec should instead have a section on the recommended way to translate
> back and forth, with minimal loss of information.

That is not true. Point 2 has been met fully. Point 1 was half-met.
Existing SGML tools *can* read XML documents. They just cannot
(typically) write them without some small tweaks.

The rest of your post proceeded on the assumption that points 1 and 2
had failed, but they have not.
 
> The "grove" concept that was
> retrofitted onto SGML is an intellectual tour-de-force, but also proof
> that the SGML spec was incomplete. If the SGML spec had said
> explicitly that no meaning must be attached to such things as the
> choice of delimiters or the order of attributes, then the grove
> wouldn't have been necessary.

That is not true at all. The grove is an abstraction of the structure of
documents. Yes, it allows the separation of syntax and logical
structure. But this has nothing to do with choice of delimiters or
attributes. Thanks to the grove I can make a radically different markup
language or meta-markup language and expect things like DSSSL scripts
and HyTime queries to continue to work (obviously *parsers* have to
change, but back-end tools and logical deductions continue to work). The
grove to be the borg that subsumes all of the non-SGML data formats in
the world. It is the formalism that allows us to reason about documents
without resorting to discussions of a specific syntax or markup
language. It provides us a way "through" the backwards compatibility
problem.

The grove is the "relational database model" for documents.

 Paul Prescod
Received on Thursday, 15 May 1997 11:11:05 UTC