Re: Welcome to the XML-URI list from W. E. Perry on 2000-05-15 (xml-uri@w3.org from May 2000)

From: W. E. Perry <wperry@fiduciary.com>
Date: Mon, 15 May 2000 15:47:32 -0400
To: Tim Berners-Lee <timbl@w3.org>, xml-uri@w3.org, xml-dev@xml.org
Message-ID: <39205454.B26BD786@fiduciary.com>
Tim Berners-Lee wrote:

> There are those who would maintain that a
> namespace should have no semantics, but I would say that then documents will
> have no semantics, and will  be useless to man or machine.  [You can go
> through the philosophical process of defining all semantics in terms of
> syntactic operations, of course, in which case the pedant can take the view
> that all is syntax, but it is not a helpful view when making this decision,
> as it leaves the main points the same and just gives a more difficult
> framework for most people to think of it].

I do not flatter myself that the Director reads my postings, but I have argued
for many months (e.g. http://xml.org/archives/xml-dev/2000/03/0380.html ) that
semantics are local to the node where instance markup is processed and that the
XML family of specifications could (should!) aspire to no more than the
specification of syntax. In an Internet topology, the effective definition of a
process is the form of its execution at a particular occasion on a 'client-side'
node. Some processes--the html browser's processing of a link; the node's
request to DNS to resolve a name not in its host table--are performed in
standard ways by common software which first had to be distributed to and
installed individually on 200 million plus nodes. A goal (or if stating it that
way is now seen as historical revisionism, then an advantage) of XML from the
start was that interoperability would not require updating software on those 200
million plus nodes to conform to some new procedure. The trivial example was
that a new bit of browser behavior would require neither another iteration of
the html spec nor new non-standard vendor-specific code, either of which then
had to be distributed out onto all those nodes.

The chief question raised--and left unanswered--in XML 1.0 was how the specific
local functionality, required to implement the behavior implied by new markup,
would be implemented. One very early book on XML essentially assumed that Java
code would need to be written at every node to realize the functionality
described by every new use of markup. In retrospect, this may have shown a
clearer understanding than recent XML specifications of the nature of--and the
place of XML markup in--a decentralized peer-to-peer Internet topology. At least
it was clear that the implementation of behavior was idiosyncratic and local to
the node. In the past two years, consensus opinion implies that there will be
standard XML processors at each node--'standard' in this case implying that a
processor implements features of the XML family of specifications in a
predictable and non-self-contradictory manner. On that assumption, XML
specifications have been written, at least since the 'Namespaces in XML
Recommendation', not simply as syntactic prescription, but with very definite
opinions of how defined syntactic structures of markup should be processed at
the local node, and of what the semantic outcome of that processing should
predictably be. In my (admittedly heretical) opinion, the burden of semantic
expectation upon XML specifications has increased exponentially from the days of
the original PI namespace processing to the current Schema draft, and will very
likely do so again by the time Query reaches PR status.

It is therefore little surprise that the XML community has reached the current
"coordination" hassle. The definition of equivalence in namespaces as simple
character-by-character matching is a vestigial remnant of the time when the
acknowledged purview of XML was text, and that is embarrassingly primitive to
those who have since gone on to specify much of the arbitrariness of text out of
XML. Let us honestly admit that the fundamental objection to simple character
matching is that it is insufficiently dense in semantics for the current taste
and practice in specification making. That admission recognizes the trend which
has reduced regexp tools to a decidedly second-class status in the XML world,
precisely because prescribed markup, and content, now bears semantic meaning
well beyond the reach of mere text manipulation tools.

The coordination hassle, however, is not confined to namespaces, and if it is
sufficient to halt work while the contradictions it has introduced to namespaces
are resolved, it should be worth all of our while to take this time to look at
the general form of the problem and to consider its general solution. I have
spent all of the year thus far wrestling with the problems of designing a system
in which excerpts of running text, some quite long, must be committed to an
XML-specific database. Within that database, the text may not be BLOB'ed but
must always remain directly accessible as running text. At the same time, either
the user who originally commits that text, or any other user of it, may by
embedding markup into it note his own understanding of the significance and
internal relationships of that text, or its relationships to external database
objects, some of them also arbitrary text, which might not be accessible to any
other users of the original text. In other words, the text must both remain
simple text and also serve as the vehicle for whatever semantics a particular
user may have an interest in.

Notice that such an interest is effectively expressed only in the use which a
particular user might make of the semantics (introduced by his own markup, or by
the markup of others which he might have access to) and of the text itself, for
a particular purpose on a particular occasion. In order to effect that use, the
user must apply processing. That processing must be in large part idiosyncratic.
He cannot simply invoke the processes of another user to handle the semantics
introduced by that user because in combining that user's semantics with his own
and that of others which that user knows nothing of, he may well have altered
that user's semantics beyond what that user's processes ever contemplated, or
could handle. Actually, the problem is often not even as complex as that, but is
still a problem:  the user on any particular occasion will often have an
entirely different intent for the semantics, or even for the simple text, of
another user. That difference of intent utterly alters what processing must be
applied, and alters it in a way which can only be known at the specific node, in
the specific instance. We already have an example of this general problem in the
specific case of namespaces:  the question of whether anything should be
retrievable from a URI when that URI is used solely as a namespace reference.
Frankly, I don't see how we can adequately resolve the coordination hassle
without revisiting that question, as well.

> A document is a communication between publisher and reader. Its significance
> is the product of its contents and the definitions of the terms it uses.  As
> we have increasingly powerful schema languages, we can say now syntactic
> (with xml-schema) and later semantic things about those terms, until
> eventually we will, in a machine-processable document, be able to relate one
> XML namespace to others in a web so as to allow machine conversion between
> systems using different namespaces, and searches and inference across many
> different applications.  There is, therefore a great deal to be said for
> using, for namespaces, a URI which allows one to look up some definitive (or
> non-definitive) information about it.  This applies the power of the web at
> a new level: in bootstrapping from one language into another.

The significance of a document is the product of its contents and the
definitions of the terms it uses *as applied in the instance, by the reader, to
the document*. Where the reader fetches those definitions from is decidedly
secondary to the process by which they are applied in the instance, and to the
outcome of that process. The framer of one set of such definitions (be it W3C WG
or vertical industry consortium defining an industry transactional data
vocabulary) cannot know the specifics of that instance unless it exercises a
cartel power to prescribe the circumstances in which its definitions are
permitted to be used. Let us assume that we are committed to openness and
extensibility, and so rule out reliance on that restrictive cartel power. If,
then, the framers of definitions cannot know the specific circumstances in which
those definitions will be applied, they cannot predicate their design of those
definitions on the expected semantic outcome of their use. That change of
perspective would alter utterly not only the terms of the present coordination
hassle, but the dozens of analogous hassles which lie hidden in specifications
whose syntax is designed to effect an expected semantic result. That change of
perspective is a much bigger solution than, I suspect, was wanted when this
problem was opened for discussion, but it does provide an intellectually
defensible way out of the problem. Might we debate, now, the specifics of how
processing is to be implemented at the individual autonomous node so that the
semantic intentions of the definers do not matter, and any or all of the allowed
syntactic forms might be successfully processed?

Respectfully,

Walter Perry
Received on Monday, 15 May 2000 15:47:37 UTC