Priorities in XML
(in response to: April Fools RANT about Catalogs)
In message <3341DDE1.73C4@csclub.uwaterloo.ca> Paul Prescod writes:
> Generic markup is also the source of SGML's power -- the power to define
> your own, perhaps non-interoperable documents. XML will not change this.
> I will not be able to download one of Peter M-R's chemical models and
> spin around molecule models in an arbitrary browser (unless he delivers
> his code as a Java applet). XML gives him the power to define something
> that is mostly non-interoperable with my browser, because that is what
> he needs to do to get his job done. When I define a 3D scene in XML
[It is automatically interoperable if certain directions are taken]
I have tried to keep quiet about the PUBLIC debate, partly because I am not
knowledgeable about the relevant RFCs and use of catalog, but since it raises
serious implementation issues I think it's important. [I'm also slightly
worried that it's diverting attention from some other issues. I know we are
all working flat-out for WWW6, but there are some issues which I'd like
guidance on before then :-) (which *are* in the drafts).] Since I'm mentioned
in the quoted message, I'll take the opportunity to reply.
The PUBLIC discussion seems to have two main threads. One is TTLG, Chapter 8:
"the name of the DTD is called", the "DTD is called", etc. This is important
in its own right, just as the distinction between Element, ElementType and GI,
but it's not my main concern at present. The other is implementation.
The hidden problem is that some people see XML as an opportunity for a
a. black-box installation
b. totally-reliable transfer of information
c. application-independent (i.e. can do molecules, Beethoven, etc.)
g. inter- and intra-net compatible
h. totally automatic operation (i.e. human-free).
method of distributing complex hyper-resources. It's clearly not at
present. Some of the check-boxes above have been demonstrated in some
applications, but not all at once. If we aim for all of these, we are
unrealistic and so some of the boxes have to go.
SGML has been built (as far as I can gather) on (b) being at the top of the
list with (h) close behind. (i) doesn't figure.
HTML has been built on (i), (a), (e), (d), (f), (g) in some order. (h) and
(b) have little priority.
XML has yet to work out which its priorities are at present, though in principle
it can offer many of these in a year's time or so (but probably not all).
The PUBLIC debate (which is only one of several areas in XML where conflict
can arise) is at least in part due to the conflict between (h) and other
My priorities for CML are (b) - the whole whole point is that the information
is precisely captured, described and maintained. I don't *have* to use XML
but it's ideal in its present state. [I started CML, before XML was
announced, as an SGML-based approach to chemical and technical information.]
Anything else is a bonus. However I have been seduced by XML and Java into
thinking that I can manage also (a) (c) (d) (g) (e) and a bit of (i). I shall
also (I hope) produce some simple material so that people will get sufficiently
enthusiastic that they will put in the effort to overcome the others.
From past experience I have come to realise that *I* cannot design a language
without implementing it at the same time. It's easy to add seemingly simple
things that have major consequences. That's one reason why I ask dumb questions
some time after they have been discussed - because I'm trying to code them.
[I've only just got as far as starting to implement XML-LINK, for example,
and asked some simple questions on xml-dev. Clarification still awaited :-)].
In the past most of my efforts have failed on (a). Indeed it's only in 1997
that we have any chance of addressing this, which for my purposes is the
major concern about PUBLIC and SYSTEM. Essentially what some of the WG are
aiming at is to
* automagically deliver a complete working infallible maintenance-free XML
system to a user without the user even being aware that
XML/DSSSL/Java/lots_of_other_things even exist.
On good days I share in most of this vision :-) I'd also like it to be shared
and developed by the community.
IMO Mosaic and httpd were what made the WWW take off because they were:
- installation-free (fairly)
In the first incarnations they (particularly Mosaic) were NOT robust and
this paralleled the fact that the hyperdocuments weren't particularly robust
I believe that appropriate tools give XML the same opportunity in 1997
as Mosaic/httpd did for HTML in 1993 - if we get the distribution right.
My first experience of trying to distribute SGML was with CosT (Joe English's)
version. To deliver what I wanted I had to deliver
- my (medium-complex) DTDs
- some entity files
- CoST (in tcl/tk)
- a browser (also in tcl/tk)
- molecular add-ons
- and some documents :-)
to users who had never heard of any of these things. Not surprisingly it
didn't fly - the actual thing that finally stopped me was the difficulty
of porting costwish to tcl/tk under Windows (let alone the Mac).
Last year I was sent a free copy of PanoramaPro (thanks SoftQuad) and was
impressed by the way it delivered documents over the WWW. When I pointed it
at a URL it whirred and clicked like a Heath Robinson machine loading entities,
DTDs, styleSheets as well as the document. I suspect that an SGML-illiterate
could have done this as well as me, though they wouldn't have understood what
was happening. The whole set of files delivered over the WWW essentially
represent a single hyperdocument, and internal self-consistency is critical.
However PP also has its own local files (entities, DTDs, etc.) and the
user can modify what is in these directories, etc. In that way it seems
possible that the user can foul up the self-consistency quite easily by
replacing (say) one entity set with a different one under the same filename.
The robustness is predicated on the user not fingerpoking in the wrong
JUMBO. JUMBO aims to solve some of the problems that costwish couldn't. Being
in Java it is:
- installation-simple (trivial where Netsplorer provides a JVM)
- training-free (relatively, since the technology is widepsread)
- inter- intra- net friendly
It is not yet robust.
The key question is:
** how do I package/deliver all the components of the installation so
that they are installation-trivial, robust, and self-consistent? **
This seems to be at the root of some of the PUBLIC debate - are either
SYSTEM or PUBLIC robust enough to ensure that the correct document set
I have spent the w/e rewriting JUMBO so that it uses URLs throughout. This
may seem trivial to many of you, but having started [JUMBO] off as an
application (i.e. not an applet) which uses files, it was a revelation to
me. URLs maintain consistency of addressing for any JUMBO document whether
XML, DTD, or class. It allows (in principle) any of these to be downloaded
over the WWW.
Java is particularly supportive of the consistency of a set of documents,
when used in a restrictive environment (e.g. a JVM in a browser) since only
one site can be visited.
It's fairly easy to make sure that only the 'right' set are accessed.
As an example, the current distribution for JUMBO includes:
- dtd.classes (a list of the DTDs in the distribution and their
- the classes for each DTD
- (if required) the DTDs and their entity sets.
Since the whole lot can be downloaded from a server, the consistency ought to
So - in answer to Paul's query - any browser will be able to manage arbitrary
DTDs so long as the classes can be located. For example, when JUMBO detects a
DOCTYPE of PLAY, it looks through its local dtd.classes, finds that it has
some classes which understand Shakespeare rather than molecules and dynamically
loads these. [I hope this can be shown at WWW6]. IFF the browser encounters
a DOCTYPE of FOO, then so long as it can locate FOO.class it can
render/transform/whatver. If I hadn't got PLAY.java locally, then it could be
(potentially) downloadable from the same site as the *.sgm.
We need to be able to locate the ancilliary documents consistently and robustly.
Personally I don't mind whether this is done through SYSTEM or PUBLIC or both,
but we have to have a clearly defined mechanism for the various types of
environment that it will be done in. If we know that documents are ONLY
going to be delivered into a JVM so that relative addresses cannot break
then SYSTEM would seem to work. If we are expecting users to configure their
resources (e.g. to minimise bandwidth usage), and if we expect them to do
some fingerpoking, then relative addresses will break. PUBLIC would detect
the break, even if it couldn't mend it. If we don't mind
URLs decaying, then the integrity of the information will be maintained
although the system may not work. I can live with that, others may not be able
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences