W3C home > Mailing lists > Public > w3c-sgml-wg@w3.org > May 1997

XML, SGML & the Web (was: Shorthand for default attributes)

From: Bert Bos <bbos@mygale.inria.fr>
Date: Thu, 15 May 1997 15:00:17 +0200 (MET DST)
Message-Id: <199705151300.PAA13772@mygale.inria.fr>
To: w3c-sgml-wg@w3.org
Paul Prescod writes:
 > >  > Yes, but it creates unbounded linear dependencies, forcing the parsing of
 > >  > an entire document from the beginning, with all entitiy references
 > >  > resolved. A State-independent solution allows "lazy" entity parsing, and
 > >  > re-use of partial documents as well-formed XML fragments.
 > > 
 > > True, in the worst case, but there are several arguments why this is
 > > not a big problem:
 > > 
 > >   - The vast majority of documents is small, on the Web that is even
 > >     more true than elsewhere.
 > Not true! Most HTML *files* are small. There are many massive documents on
 > the Web that are broken into non-intuitive, hard to use chunks because the
 > Web is massively optimized for small documents instead of for retrieving
 > small parts of large documents. *WE MUST NOT PERPETUATE THIS MISTAKE*.

OK, the Web is one huge document...

No, I don't agree with you. There are nodes in the Web, we usually
call them documents. It is convenient for people to work with chunks
of information of a certain size. There is usually some intuitive
reason for putting a certain amout of information in a document, and
it turns out that most people write documents (both on the Web and
elsewhere) that are a similar size. Letters are one or two pages,
articles are less than a dozen pages, books are about 300
pages. Anything larger than that is an exception. If you look at a
graph of the number of documents versus their size, you'll see a curve
that falls off exponentially with increasing document size. This is
not (only) due to the computer; it is the way people function.

Anything larger is also unlikely to be hierarchical. It is hard enough
to create a linear document of a dozen pages, for something the size
of a book you already need several months. The Web gives an alternate
structuring method, so use it! What is XML-link for, if not for that?

With current network speeds, a book of 300 pages will not yet be
downloaded in 3 seconds, but that situation will improve. Parsing 300
pages is not a problem for current computers. Maybe it would be a
problem to parse the whole Encyclopeadia Brittannica, but as I said,
that "document" is an exception.

And the example of the encyclopedia also shows that large documents
tend to be very regular in structure: they are databases made up of
records. It is no coincidence that the only really large documents are
databases. To handle things that large, people need a rigid
structure. DBMSs deal with gigabytes pretty well, requiring a generic
XML parser to deal with it doesn't sound reasonable to me. Instead,
pipe the DBMS output into the XML parser and be done with it.

 > >   - You can arbitrarily limit namespaces by putting a !doctype
 > >     somewhere. 
 > Then you introduce many OTHER namespace problems like IDs, entities etc.

ID's must be unique in the whole document, not just the subdocument.
(Otherwise we'll have to change the xpointer syntax, and I rather like
it the way it is.) Of course, parsers don't care whether an ID is
unique or not, they just assume it is.

I don't need entities (but if you can convince me that I do, they are
local, just like attributes).

 > > I agree with you there, but there is a fallacy in calling them "PIs",
 > > since PIs are a term from SGML, and in SGML they are not targeted at
 > > SGML parsers, but at the applications built on top of the parsers.
 > > 
 > > You're defining XML, you need a widget to define something that is
 > > common to, and obligatory for all XML parsers. You can use whatever
 > > syntax you like. Who cares whether it looks like SGML or not?
 > Please see:  http://www.textuality.com/sgml-erb/dd-1996-0001.html
 > These are our goals and I feel that it is too late to change them. XML would
 > be a very different language if SGML compatibility were not an important
 > goal.

Maybe. But how important is this compatibility? Here is a quote from
the document you mentioned:

    3. XML shall be compatible with SGML.

       1.Existing SGML tools will be able to read and write XML data.

       2.XML instances are SGML documents as they are, without changes to
	 the instance.

       3.For any XML document, a DTD can be generated such that SGML will
	 produce "the same parse" as would an XML processor.

       4.XML should have essentially the same expressive power as SGML.

    Note: #1 and #2 describe our goal in its ideal form. If this goal is
    not achievable in its fullest form, then we may back out to a weaker
    form: it shall be simple to transform XML documents into equivalent
    SGML documents, and vice versa. Our intention, however, is to bite the
    bullet and ensure if we can that no transformation is needed to allow
    SGML tools to read and write XML document instances.

    #3 and #4 indicate our intentions accurately, but it is not yet clear
    how best to formalize and explain the phrase "the same parse", or the
    phrase "essentially the same expressive power". These remain open
    questions; see point 8 also.

Clearly points 1 and 2 are not met, so, according to the note, the
spec should instead have a section on the recommended way to translate
back and forth, with minimal loss of information.

It is my feeling that points 1 and 2 *had* to fail, and I'm glad that
they did. Now the WG should indeed `bite the bullet' and spend some
resources on discussing the best translation. (Not too many resources,
though, because there are more important things to do.)

(I said "minimal loss of information", because it is not clear what
the information content of an SGML document is (nor of an XML document
for that matter, but it's still early enough to fix that; see point 8
in the abovementioned document). The "grove" concept that was
retrofitted onto SGML is an intellectual tour-de-force, but also proof
that the SGML spec was incomplete. If the SGML spec had said
explicitly that no meaning must be attached to such things as the
choice of delimiters or the order of attributes, then the grove
wouldn't have been necessary.)

  Bert Bos                                ( W 3 C ) http://www.w3.org/
  http://www.w3.org/pub/WWW/People/Bos/                      INRIA/W3C
  bert@w3.org                             2004 Rt des Lucioles / BP 93
  +33 4 93 65 77 71               06902 Sophia Antipolis Cedex, France
Received on Thursday, 15 May 1997 09:00:35 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:25:26 UTC