- From: Norman Walsh <Norman.Walsh@Sun.COM>
- Date: Tue, 03 Jan 2006 15:28:54 -0500
- To: www-tag@w3.org
- Message-ID: <87wthh6oop.fsf@nwalsh.com>
Hello world, Several current TAG issues (at least namespaceDocuments-8 (maybe), xmlFunctions-34, RDFinXHTML-35, rdfURIMeaning-39, and namespaceState-48 (maybe)) relate, in one way or another, to the "self describing" nature of the web. That is, the principle that you can start somewhere and "follow your nose" to work out what you've got. It came up on today's TAG call and it came up at the December face-to-face. Following the face-to-face, I tried to write down what I thought we meant by the self-describing web and why it's an important feature. This was drafted with a mind towards it being the preface to a finding on xmlFunctions-34 which Henry and I are on the hook to draft. (It's also closely related to the charter of the XML Processing Model WG which I'm chairing.) Anyway, I floated it a bit privately with mixed results so I'm just going to heave it into the public and see what reaction it elicits. :-) --- The web has been successful for both social and technological reasons. Broadly it is composed of identifiers, protocols, and formats that are sufficiently orthogonal that innovation can occur independently in these three spaces: a new format can be deployed over an existing protocl; a new protocol can be used to transmit an existing format; and, when necessary, a new identifier scheme can be invented which is, in principle, independent of the protocol used to interact with resources thus identified and can identify resources with representations in any format. An important, but sometimes overlooked, property of the web which enables this independent innovation over identifiers, protocols, and formats is that the web is largely self-describing. One common interaction pattern proceeds like this: an engineer, presented with a URI, can read the URI syntax specification to learn what components are in the URI. This will lead to a URI scheme specification where she will find information about how to access resources identified with that scheme (assuming such access is possible). For example, she might find that the scheme delegates to DNS to identify a machine on the network and suggests a protocol for interacting with resources identified with URIs in this scheme. She can read the DNS specifications to learn how to translate the machine name into an IP address, she can read the TCP/IP specification to learn how to communicate with a machine at a given IP address, and she can read the protocol specification to learn how to interact with the resource. That interaction will possibly return a stream of bits and an identifier, such as a MIME media type, which will indicate how those bits are to be interpreted. Following the media type registration will lead to a format specification where she will learn how to interpret the bits and what information content is embodied in them. Now she "knows" the information content of that representation despite the fact that the URI scheme, protocol, and format involved were independently invented long after the web was born. An equally important, but even more often overlooked, reality is that this property is, for lack of a better word, "invertible". When I mint a URI, associate it with a resource, establish a server with which communication can occur, and provide a representation and an identifier that describes how that representation can be interpreted, I have explicitly licensed the engineer to conclude that I made the information content of that document available and I am responsible for it. To take a concrete example, if Dirk publish this representation: <?xml version='1.0'?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>My Home Page</title> <meta name="Author" content="Dirk"/> </head> <body> <p>I like brussel sprouts.</p> </body> </html> at http://www.example.org/home/dirk/ and serves it with the MIME media type "application/html+xml", Dirk has in some real sense said he likes brussel sprouts. When this chain of events begins with a URI and ends with a document in a particular format, we can say that the information content of that document is "grounded in the web." It is important to the future of the web that it remains the case that documents can be published which are grounded in the web, and in fact, that it remains the *common case* that documents on the web are grounded in the web. It will always be possible, and sometimes necessary, to publish documents which are not grounded in the web. Publishing a sequence of Unicode characters that is not a well-formed XML document and labelling it with an XML media type, for example, results in a document with no information content that can be said to be grounded in the web. The document isn't XML but it was identified as XML and that's an unresolvable error. In many cases, it's sufficient to say that the information content of a document is its media type and its bits. For example, a PNG image, an RDF graph, and a text/plain document, have whatever information content the relevant format specifications say they have. In the particular case of RDF, extracting this information may require an appeal to subsequent specifications (RDF schemas, ontologies, etc.) but this is entirely reasonable and within the definition of the self-describing web that results in documents that are grounded in the web. However, documents identified simply as application/xml (and to some extent application/*+xml), are a special case. XML was so obviously and explicitly and intentionally designed as an extension point in the web architecture that to say that the only information content of such documents is that which the XML Recommendation gives them would be akin to erecting a public nuisance on the web. The XML Recommendation very clearly defines only the syntax of XML and offers almost no description of the information content of the document at all. Nevertheless, we now have a family of XML specifications that interact in significant ways. Different XML vocabularies can be combined by authors in nearly arbitrary ways. Independent invention arises every day in the XML space. In order to preserve the self-describing nature of the web, it has been proposed that we define an "XML-functions" approach to determining what information content can be understood from an XML document that is grounded in the web. We can not, and should not try, to assert that all XML documents are grounded in the web, we need only provide a framework for allowing authors to, in the common and usual case, publish XML documents that *are* grounded in the web. Be seeing you, norm -- Norman.Walsh@Sun.COM / XML Standards Architect / Sun Microsystems, Inc. NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
Received on Tuesday, 3 January 2006 20:29:11 UTC