The self-describing web...

Hello world,

Several current TAG issues (at least namespaceDocuments-8 (maybe),
xmlFunctions-34, RDFinXHTML-35, rdfURIMeaning-39, and
namespaceState-48 (maybe)) relate, in one way or another, to the "self
describing" nature of the web. That is, the principle that you can
start somewhere and "follow your nose" to work out what you've got.

It came up on today's TAG call and it came up at the December
face-to-face. Following the face-to-face, I tried to write down what I
thought we meant by the self-describing web and why it's an important

This was drafted with a mind towards it being the preface to a finding
on xmlFunctions-34 which Henry and I are on the hook to draft. (It's
also closely related to the charter of the XML Processing Model WG
which I'm chairing.)

Anyway, I floated it a bit privately with mixed results so I'm just
going to heave it into the public and see what reaction it elicits. :-)


The web has been successful for both social and technological reasons.
Broadly it is composed of identifiers, protocols, and formats that are
sufficiently orthogonal that innovation can occur independently in
these three spaces: a new format can be deployed over an existing
protocl; a new protocol can be used to transmit an existing format;
and, when necessary, a new identifier scheme can be invented which is,
in principle, independent of the protocol used to interact with
resources thus identified and can identify resources with
representations in any format.

An important, but sometimes overlooked, property of the web which
enables this independent innovation over identifiers, protocols, and
formats is that the web is largely self-describing.

One common interaction pattern proceeds like this: an engineer,
presented with a URI, can read the URI syntax specification to learn
what components are in the URI. This will lead to a URI scheme
specification where she will find information about how to access
resources identified with that scheme (assuming such access is
possible). For example, she might find that the scheme delegates to
DNS to identify a machine on the network and suggests a protocol for
interacting with resources identified with URIs in this scheme. She
can read the DNS specifications to learn how to translate the machine
name into an IP address, she can read the TCP/IP specification to
learn how to communicate with a machine at a given IP address, and she
can read the protocol specification to learn how to interact with the
resource. That interaction will possibly return a stream of bits and
an identifier, such as a MIME media type, which will indicate how
those bits are to be interpreted. Following the media type
registration will lead to a format specification where she will learn
how to interpret the bits and what information content is embodied in
them. Now she "knows" the information content of that representation
despite the fact that the URI scheme, protocol, and format involved
were independently invented long after the web was born.

An equally important, but even more often overlooked, reality is that
this property is, for lack of a better word, "invertible". When I mint
a URI, associate it with a resource, establish a server with which
communication can occur, and provide a representation and an
identifier that describes how that representation can be interpreted,
I have explicitly licensed the engineer to conclude that I made the
information content of that document available and I am responsible
for it.

To take a concrete example, if Dirk publish this representation:

  <?xml version='1.0'?>
  <html xmlns="">
  <title>My Home Page</title>
  <meta name="Author" content="Dirk"/>
  <p>I like brussel sprouts.</p>

at and serves it with the MIME media
type "application/html+xml", Dirk has in some real sense said he likes
brussel sprouts.

When this chain of events begins with a URI and ends with a document
in a particular format, we can say that the information content of
that document is "grounded in the web."

It is important to the future of the web that it remains the case that
documents can be published which are grounded in the web, and in fact,
that it remains the *common case* that documents on the web are
grounded in the web.

It will always be possible, and sometimes necessary, to publish
documents which are not grounded in the web. Publishing a sequence of
Unicode characters that is not a well-formed XML document and
labelling it with an XML media type, for example, results in a
document with no information content that can be said to be grounded
in the web. The document isn't XML but it was identified as XML and
that's an unresolvable error.

In many cases, it's sufficient to say that the information content of
a document is its media type and its bits. For example, a PNG image,
an RDF graph, and a text/plain document, have whatever information
content the relevant format specifications say they have. In the
particular case of RDF, extracting this information may require an
appeal to subsequent specifications (RDF schemas, ontologies, etc.)
but this is entirely reasonable and within the definition of the
self-describing web that results in documents that are grounded in the

However, documents identified simply as application/xml (and to some
extent application/*+xml), are a special case. XML was so obviously
and explicitly and intentionally designed as an extension point in the
web architecture that to say that the only information content of such
documents is that which the XML Recommendation gives them would be
akin to erecting a public nuisance on the web. The XML Recommendation
very clearly defines only the syntax of XML and offers almost no
description of the information content of the document at all.

Nevertheless, we now have a family of XML specifications that interact
in significant ways. Different XML vocabularies can be combined by
authors in nearly arbitrary ways. Independent invention arises every
day in the XML space.

In order to preserve the self-describing nature of the web, it has
been proposed that we define an "XML-functions" approach to
determining what information content can be understood from an XML
document that is grounded in the web. We can not, and should not try,
to assert that all XML documents are grounded in the web, we need only
provide a framework for allowing authors to, in the common and usual
case, publish XML documents that *are* grounded in the web.

                                        Be seeing you,

Norman.Walsh@Sun.COM / XML Standards Architect / Sun Microsystems, Inc.
NOTICE: This email message is for the sole use of the intended
recipient(s) and may contain confidential and privileged information.
Any unauthorized review, use, disclosure or distribution is prohibited.
If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.

Received on Tuesday, 3 January 2006 20:29:11 UTC