Re: XML, SGML & the Web (was: Shorthand for default attributes) from Bert Bos on 1997-05-15 (w3c-sgml-wg@w3.org from May 1997)

From: Bert Bos <bbos@mygale.inria.fr>
Date: Thu, 15 May 1997 23:57:55 +0200 (MET DST)
To: w3c-sgml-wg@w3.org
Message-Id: <199705152157.XAA03725@mygale.inria.fr>
Steven J. DeRose writes:
 > At 03:00 PM 05/15/97 +0200, Bert Bos wrote:
 > 
 > > > Not true! Most HTML *files* are small. There are many massive documents on
 > > > the Web that are broken into non-intuitive, hard to use chunks because the
 > > > Web is massively optimized for small documents instead of for retrieving
 > > > small parts of large documents. *WE MUST NOT PERPETUATE THIS MISTAKE*.
 > >
 > >OK, the Web is one huge document...
 > >
 > >No, I don't agree with you. There are nodes in the Web, we usually
 > >call them documents. It is convenient for people to work with chunks
 > >of information of a certain size. There is usually some intuitive
 > 
 > But rehetorical/conceptual convenience is not what is going on on the Web
 > for the most part. Things are broken up due to bandwidth constraints, and
 > because navigational sophistication is limited by limited markup and interfaces.
 > 
 > 
 > >Anything larger is also unlikely to be hierarchical. It is hard enough
 > >to create a linear document of a dozen pages, for something the size
 > >of a book you already need several months. The Web gives an alternate
 > >structuring method, so use it! What is XML-link for, if not for that?
 > 
 > This is incorrect. Most big documents are richly, intensely, fundamentally
 > hierarchical. There are lots of reasons for this, including cognitive and
 > linguistics ones as well as practical/access ones. I've done statistical
 > analysis on the markup of large documents (ranging up to hundreds of MB). 

Statistics never tell you which direction the correlation goes. Your
finding supports my argument exactly: people can only deal with large
documents only if they have a rigid structure. If information doesn't
have that structure, it will be put in hypertext instead.

XML is not a hypertext format. XML is a format for *one node* in a
hypertext, just as HTML.

 > 
 > >With current network speeds, a book of 300 pages will not yet be
 > >downloaded in 3 seconds, but that situation will improve. Parsing 300
 > >pages is not a problem for current computers. Maybe it would be a
 > >problem to parse the whole Encyclopeadia Brittannica, but as I said,
 > >that "document" is an exception.
 > 
 > Parsing 300 pages will always be annoyingly slow. Try it off your local HD;
 > the net is not the only problem. If document open time rises from one second
 > to three, it's a big problem. And the last time I benchmarked NS 3 on a
 > Pentium 120, it took several *minutes* to bring up a 400 page document off a
 > *local* and very fast disk.
 > 
 > 
 > >    3. XML shall be compatible with SGML.
 > >
 > >       1.Existing SGML tools will be able to read and write XML data.
 > >
 > >       2.XML instances are SGML documents as they are, without changes to
 > >	 the instance.
 > >
 > >       3.For any XML document, a DTD can be generated such that SGML will
 > >	 produce "the same parse" as would an XML processor.
 > >
 > >       4.XML should have essentially the same expressive power as SGML.
 > 
 > >
 > >Clearly points 1 and 2 are not met, so, according to the note, the
 > >spec should instead have a section on the recommended way to translate
 > >back and forth, with minimal loss of information.
 > 
 > Huh? A very large set of existing SGML tools can and do read XML documents.
 > And a lot of them didn't need anything but a new SGML declaration. And XML
 > document instances *are* SGML document instances. We've said from the very
 > beginning that we were not requiring them to be SGML *under the same DTD and
 > SGML declaration*; just that such declarations exist.

Originally I thought that it should be possible, and it was an
interesting puzzle to try to find such a declaration and a rewrite of
a DTD. I never managed to do it. And it isn't of any practical use
either, since rewriting the DTD will not be a mechanical process and
most tools can't change the SGML declaration anyway.

Which tools did you try (and what SGML declaration)?

  - (n)sgmls can't read them without a doctype.

  - Even with a doctype it can't deal with "/>", unless I set NET to
    be "/>", but that is incorrect and leads to erroneous results in
    many cases.

  - (n)sgmls also ignored some REs, no matter what the content model
    (I even tried to rewrite the DTD to use inclusion
    exceptions as much as possible - it helped some, but not enough).

  - The various HTML browsers I tried couldn't deal with "/>", some
    could when I preceded it by a space.

  - Dan Connolly's sgml-lex
    (http://www.w3.org/pub/WWW/MarkUp/SGML/#sgml-lex) couldn't
    either.

  - psgml can't deal with "/>".



Bert
-- 
  Bert Bos                                ( W 3 C ) http://www.w3.org/
  http://www.w3.org/pub/WWW/People/Bos/                      INRIA/W3C
  bert@w3.org                             2004 Rt des Lucioles / BP 93
  +33 4 93 65 77 71               06902 Sophia Antipolis Cedex, France
Received on Thursday, 15 May 1997 17:57:58 UTC