Re: Clarifying what a URL identifies (Four Uses of a URL) from Sandro Hawke on 2003-01-22 (www-tag@w3.org from January 2003)

From: Sandro Hawke <sandro@w3.org>
Date: Wed, 22 Jan 2003 00:17:24 -0500
To: Tim Bray <tbray@textuality.com>
cc: David Booth <dbooth@w3.org>, Michael Mealling <michael@neonym.net>, www-tag@w3.org
Message-Id: <200301220517.h0M5HO316710@wadimousa.hawke.org>
> I suggest you read RFC2396 and the Webarch draft.  When I say a 
> formalism I mean formalism.   A resource is per RFC2396 "anything that 
> has identity" and a URI is that which identifies a resource.

As Mike Mealling puts it -- a "platonic ideal".  There is exactly one
resource per URI by definition.  (Or, roughly, until you start getting
301 responses.)  We can't know what that resource "is"; it's just an
unknowable mental construct RFC2396 defines as existing.

> A resource, thus defined, has access mechanisms whereby you can retrieve 
> and update representations.  This formalism is complete, consistent, and 
> highly robust in practice, underlying the construction of the most 
> succesful information system in history.

In fairness, I think this only applies to HTTP 1.1, not the entire
web.  Yes, the design of HTTP 1.1 uses an elegant and effective
abstraction; from TimBLs earlier ideas of information resources (which
he likes to call documents), this protocol makes a furthur step into
abstraction saying, in effect "We don't care what an HTTP URI
identifies; whatever the "resource" is, we just handle (MIME entity)
representations of it."  And that's a fine thing, within the context
if HTTP itself.  As far as I can think right now, you are right here
in saying this model is consistent and has been very useful.

> I admire your chutzpah in charging here and making claims about the 
> undefinedness of the term "Resource" but that doesn't mean you're 
> anything but hopelessly wrong.

You could take David's message as a sign that a whole raft of
professional software developers think this notion of Resource while
workable is somewhere between poorly explained and imperfect.
Working *perfectly* for HTTP is not evidence that it works anywhere
else.   (other people have cited the parable of the blind men and the
elephant.)   And the success of the Web is of course due to many, many
factors. 

> You go on to observe correctly that once you step outside the formalism, 
> a resource can in fact be all sorts of different things, and that it 
> would help if we had a way to talk about what kind of thing it is.  I 
> agree with all of that.  However, the web architecture as it stands 
> works just fine without being able to talk about what any particular 
> resource "is" aside from "that which is identified by its particular URI".

If web architecture == HTTP 1.1, then sure. 

Once you step outside the formalism, not only do you want to know what
kind of thing a specific Resource is, but you notice that everone is
using each URI to identify several distinct things.   So the
fundamental premise of 2396 breaks as soon as you step outside the
formalism.   

> In the Web Architecture formalism, http://x.org/love identifies only one 
> resource.  In the real world, I can learn about that resource by 
> retrieving representations of it (if any are available), and more by 
> processing RDF assertions about it (if any are available).  The Web 
> architecture doesn't talk about meanings, it talks about resources and 
> representations.  There's nothing wrong with talking about meaning, and 
> I look forward to the day when I can reliably retrieve some RDF 
> assertions and learn that this particular URI identifies nothing but a 
> JPG of a cute cat, and this other one identifies the inner thought of a 
> drug-addled conceptual artist.  This would be good and useful.

And if you http GET a representation of the artist, what will the
Last-Modified field mean?  It doesn't mean when the representation was
last modified, or when the resource (the artist) was last modified.  

To quote RFC 2616:

    The exact meaning of this header field depends on the
    implementation of the origin server and the nature of the original
    resource. For files, it may be just the file system last-modified
    time. For entities with dynamically included parts, it may be the
    most recent of the set of last-modify times for its component
    parts. For database gateways, it may be the last-update time stamp
    of the record. For virtual objects, it may be the last time the
    internal state changed. 

This is where you have to start making a distinction between objects
in the domain of discourse (like artists), and information which
computer systems hold about those objects.  In object-oriented design,
you intentionally ignore this distinction, but when you start getting
into manipulating the data directly, you need to notice it again.  Can
you draft text about Last-Modified that makes sense with the resource
being an artist?

So maybe the a-Resource-is-anything-with-Identity idea doesn't even
really hold for HTTP 1.1.   [ Sandro continues his argument that the
word "Resource" masks an underlying disagreement and confusion even in
the design of HTTP 1.1.   It's not so bad that an expert human
implementor can't sort it out and know when it refers to the object in
the domain of discourse and when it refers to the computer's
information about that object, ... but it is bad. ]

> At the moment, speaking for myself, my impression is that the TAG has no 
> intention of saying anything beyond what's in 2396 and the Webarch draft.

Then I wonder where this will get worked out. 

My best idea right now is to start a collection of ontologies of the
web.   The need for a single vision will be much reduced if/when the
various different visions are clearly laid out.  Any fans of 2396 and
2616 psyched to encode them into OWL?   (Dan Connolly is the shoe-in,
but I couldn't possibly motivate him to do this.)

> The reason I'm willing to put so much energy into this is that I 
> agonized for a long time over the fact that in reality URIs identify 
> lots of different kinds of things and everybody was ignoring this 
> elephant in the room.  Weirdly enough, this angst never got in the way 
> of my building spiders and search engines and visual maps of webspace 
> and all sorts of other useful things.  It is quite possible that the Web 
> Architecture works *because* it works around the intractable problems of 
> meaning and only deals with comparing identifiers and shuffling 
> representations around; avoiding a lot of problems that historically 
> have been intractable.

I wonder how different the web would be without HTTP.  How much of the
web functionality we use today could be implemented just fine with a
subset of 1985's RFC 959 FTP protocol, accessed via ftp: URLs?  In the
days of Mosaic, I saw web sites done like this; it worked because
Mosaic assumed Content-Type text/html when the filename ended in
.html.  So the content-type abstraction would have to be done
differently, and POST would have to be done more explicitely using
STOU (Store Unique), which might be a good thing.  Various performance
issues (number of TCP connections, cache support) would come out
differently to be sure.  But the nature of URIs would be so much more
clear when they were "obviously" just filenames.   (Of course there
would be an equivalent of CGI, it would just be imagined slightly
differently.)   What does a filename (or file: URI) identify, and how
is that really different from an HTTP URI?

     -- sandro
Received on Wednesday, 22 January 2003 00:19:27 UTC