Re: A serious detail point from Peter Murray-Rust on 1997-04-18 (w3c-sgml-wg@w3.org from April 1997)

From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
Date: Fri, 18 Apr 1997 12:32:48 GMT
To: w3c-sgml-wg@w3.org
Message-Id: <5769@ursus.demon.co.uk>
This is probably a naive view of document management, so please bear with
someone from a different perspective.  

I often wonder why people post very large
documents on the WWW and then reference small sections in them, rather than
posting smaller sections arranged hierarchically.  For example, when I reference
Robin Cover's pages, they are usually about 100-200 Kbytes long, and over a slow
telephone line that's inconvenient.  (NO criticism intended of Robin's 
splendid resource :-).  I imagine it's a question of server-side 
maintainability rather than convenience for authors or readers.

In my own field we are used to highly fragmented information.  For example
'what enzymes with known inhibitors have known structures are are involved
in human genetic disease?' involves querying 5 different databases and the 
material from each is a few Kbytes.  Ideal for XML.

In message <1.5.4.32.19970418104839.00683bc4@mail.u-net.com> Martin Bryan writes:
> At 17:57 17/4/97 CDT, Michael Sperberg-McQueen wrote:
[...]
> 
> In general if you have a reference to a fragment in a document you are more
> likely to have pointers to other fragments of the same document in your
> document than to fragments of other documents. There is a good case that can

I think this is highly dependent on the discipline - it's not something that
is likely to be common where the material pertaining to a 'document' is 
provided by different 'authors'.  Imagine a technical manual for a product -
it might well reference components from 100 suppliers.

> be made for allowing some form of reusable location source identifier which
> is shorter than the full address in any addressing scheme. (Hence locsrc in
> HyTime, which you managed to drop in the latest revision of XML.) Without
> such a facility XML is of limited use. 

Again I think this depends on the discipline.  I see XML as a new language
with a large number of undiscovered applications.  The key thing IMO is
that it is the first system that allows *distributed* components of a document.
The possibilities of this are enormous.  If we consider STM publishing then
a typical publication will reference 20 other publications, and (if people
adopt the HyperGlossary approach :-) be linked into a similar number of 
terminological databases.  If/when documents acquire the 'knowledge environment'
that everyone is excited about, that will not be provided by single documents.
Likewise an author of a document can have her details XMLified.  The document 
will link to the org chart of the publisher, the publication process, the 
referees and whatever.  And that's for a conventional publication.  (Henry 
Rzepa's e-conferences on Chemistry - see www.ch.ic.ac.uk - are 'documents' 
with 150 authors.  The next one will be supported (hopefully :-) by CML/XML.)
> 
[...]
> 
> >The update and maintenance problem is handled nicely by the general
> >entity mechanism you illustrate.  The caching problem can be handled
> >with affinity groups / BOS / whatchamacallits that say "Cache this
> >one, I'm going to need it often".

It would also seem possible to include caching hints by PIs or other meta
data in the document.  For example if a document points to a large document
elsewhere it should be easy to compute the multiple hrefs to this document and
add that information (e.g. in the meta-data: 150 references to this doc,
20 to that, etc.)

The main benefit that I can see from sending large multi-fragment documents 
over the WWW is that the reader can print them.  XML allows the abstraction
of subfragments using TEI Xptrs but we should remember that there is no
*requirement* for the server to support this, so mirror sites might be 
inefficient until they are XML-aware.  Sending smaller components is bandwidth
friendly, and also allows the reader to customise things.  [An example:
"I'd like to explore DSSSL 10179:  I'd like the table of contents, the 
terminology, Chapter 9 (but NOT 9.6, please!), and none of Chapter 12
because I'm not interested in typesetting.  And (speaking to trusty client-side
user-agent) please bundle this and print it for me because I want to read it
in the bath". At present I have to pay to download the whole lot!]

Commandment 4 of XML says:

"it shall be easy to write programs which process XML documents."

So far I have been keeping my head just about above water in following the spec,
by I suspect that if XML needs to support a great deal more functionality that
won't be true. Similarly:

"XML documents should be human-legible and reasonably clear." (ibid. cmd 6)

and

"XML documents shall be easy to create" (ibid. cmd 9)

[is the "shall/should" distinction meaningful :-)]

I hope that these criteria are desirable in XML-LINK as well.

I interpret these cmds to mean that the semantics of XML documents should be as
self-evident as possible.  I'm particularly thinking of the (now) very large
number of people who will want to graduate from HTML.  So far I think XML
is within their capabilities (after all I'm one).  

	P.

> ----
> Martin Bryan, The SGML Centre, Churchdown, Glos. GL3 2PU, UK 
> Phone/Fax: +44 1452 714029   WWW home page: http://www.sgml.u-net.com/
> 
> 
> 

-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/
Received on Friday, 18 April 1997 08:24:06 UTC