RE: fragment identifiers from Joshua Allen on 2002-07-26 (www-tag@w3.org from July 2002)

From: Joshua Allen <joshuaa@microsoft.com>
Date: Fri, 26 Jul 2002 00:20:28 -0700
To: "Tim Bray" <tbray@textuality.com>, <www-tag@w3.org>
Message-ID: <4F4182C71C1FDD4BA0937A7EB7B8B4C105DCDC92@red-msg-08.redmond.corp.microsoft.com>
First, let's be clear that HTTP and REST have nothing at all to do with
resource identity.  There is absolutely no need for consensus on what
resource is being represented at a certain URL, so it is really silly to
argue about *what* the identity of that resource is.

An http: identifier in practice simply locates a "representation
dispenser for a resource".  The identity of the resource for which it
dispenses representations is immaterial, and a red herring in this
discussion.  That is because the http: identifier itself simply
identifies the *dispenser*.

In other words, http: identifiers are used to identify RESTful
representation dispensers, without regards to the actual thing being
represented.

There are people who say that HTTP URL identifies an actual resource
beyond a simple representation dispenser, because "mumble mumble you can
indirectly identify the resource mumble mumble" or "mumble mumble it
says so in somebody's masters thesis mumble mumble".  But hardly anybody
in practice actually believes such silly premises, and the web sure as
heck doesn't depend on that interpretation.  The web functions quite
fine with the status quo, which is that an http: identifier identifies
an endpoint which serves up hypermedia.  

Try this thought experiment -- if we say that http: identifiers only
identify http-accessible endpoints which function as hypermedia
dispensers, what breaks?

The answer is, NOTHING breaks.

Now for the second part.  The semantic web will presumably talk about
many types of entities beyond just hypermedia dispensers.  (Again,
certain people will argue that HTTP already identifies things beyond
hypermedia dispensers, but I think I have shown above why this is
pathetically wrong.)

The *purpose* of an identifier is to unambiguously identify something.
Axioms [1] 1 and 2a of web design make it very clear that a URI should
not be context-sensitive and should not require disambiguation.  The
people who hate these axioms say:

A) *Everything* depends on context, those axioms are paradox
B) Disambiguation might sometimes be required, so those axioms should be
abandoned
C) If something can be identified indirectly, there is no need for a
direct identifier

However, the existence of ambiguity does not make the pursuit of clarity
a worthless exercise.  The possibility (even certainty) that stupid
people will violate these axioms periodically is by no means an excuse
for despair.  People who violate these axioms will simply be blathering
nonsense which will be understood by nobody.  People who wish to
communicate on a semantic web will do their best to adhere to these
axioms.  (I am not sure if you question the wisdom of these axioms.
Please let me know if you would like some practical examples about why
axioms 1 and 2a are important.  Hopefully you agree that these are
fundamental.)

So, assuming we agree that axioms 1 and 2a apply, why is it dangerous to
use an http: identifier to identify a beach?  Well, for starters the
identifier is now unable to unambiguously identify between a particular
instance of a hypermedia server and a beach.  There are three things
that can happen:

1) Most sane people (and all web browsers) will assume that the http:
identifier points to a hypermedia server, so you will find in practice
that it is difficult to encourage YOUR definition of the "thing being
identified" on the population at large. 
 
2) You could decide that it is OK for the URI to identify different
things depending on context.  But that of course is violating the axioms
of web design.

3) You could play games with words and declare that "the hypermedia
server IS the beach", or "the hypermedia server is non-existent and
ineffable".  In other words, you could claim that there is no difference
between the hypermedia server and any particular resource that you claim
to be representing.  However, this is also a clear violation of the
axioms.  This is appealing in a world where there is only GET, because
in HTTP, the distinction between representation *dispenser* and
*resource* is irrelevant and of no practical import.  The idea that the
hypermedia server is actually the resource is a silly vanity that we
politely ignore, because we only ever use the representation dispenser
anyway.  But the instant I want to *really* identify something other
than a representation dispenser, in a scenario that doesn't involve GET,
then I should avoid the temptation to overload the well-established
consensual meaning of the http: URL.  If I succumb, why stop at saying
that the beach and the HTTP endpoint were the same?  What's to prevent
us reducing the entire universe into one word?  

And it's very difficult to see how you could ever achieve consensus on
the meaning of your identifier.  Words are defined by consensus of
people actively *using* the words.  People use http: identifiers to
identify http-accessible representation dispensers, and they get
confused as heck when anyone tries to say that they identify something
different.  As a practical matter, it is easy to get people to
understand that you mean a representation dispenser when you use an
http: URL.  They can pull it up in a web browser and prove it for
themselves.  If you decide to identify something that is *not* a
representation dispenser (a book for example), and you use an http: URI,
then you are going to have a heck of a time explaining to people that
the URI is *not* a representation dispenser.  And if you decide it can
be both, woe be upon you.

4) You could take a different approach, and say that the http: part of
the URI is not a characteristic of the thing being identified, but is
rather an unnecessary appendage that has the side-effect of making it
easy for people to do a GET on the resource.  You could say that the
resource being identified is not something that is intended to be
primarily interacted with via HTTP, but at least GET is now possible.
This is the URI version of Pascal's Wager -- "if I don't stick http: on
the front, I can never do a GET, but if I *do*, it might confuse people
during the life here below, but I might be able to do a glorious GET one
day!"  This approach is the worst yet.  

For starters, the http: part of the name becomes all but useless in
identification.  If it doesn't assist identification, it shouldn't be
part of the name.  Why not fix HTTP and web browsers so that they can
handle any URI; even ones that don't begin with http:, and then kill the
http: scheme altogether?

Second, it becomes impossible to use a URI to identify a representation
dispenser.  Maybe that is OK with some people.

Third, you end up with chunks of the global word space owned by the
highest bidder.  For web sites, it is a *good* thing that
document/representation dispenser behavior is owned by the person who
pays for the DNS segment.  But the entire point of the semantic web is
that the words will NOT be owned by the people being talked about.  If
people are permitted to "own" words at the meta-level, the semantic web
is doomed.  

Consider a small example: suppose that Culinary Press establishes
identifiers for all of their books.  One particular book is identified
as:
http://www.culinary.com/books/pub-123
Thousands of book reviews are logged at various sites, and they are
almost universally negative.  Now, Culinary Press had long ago released
a similar book, with identifier:
http://www.culinary.com/books/pub-101
which was a smashing success.  Since that one is out of print now, and
pub-123 represents a large revenue potential, they decide to swap the
two identifiers.  Now when customers search for the new book, it tells
them that the identifier is actually pub-101, and when they look for
reviews about that publication on Google, they are much more likely to
buy it.

The point is that people will be less likely to use words that they
don't trust, and using a word that is tied to DNS requires a person to
trust the administrator of the DNS segment.  Again, in HTTP you don't
care.  In the semantic web, it *does* matter.  If words are owned by the
highest bidder, people simply won't use them, and adoption will suffer.

An identifier scheme like urn:isbn:11101122 produces names which are not
tied to any particular DNS range, and are therefore less likely to be
"owned".  The word will take on the meaning that is established for it
via consensus, which is how words SHOULD form.  The URI scheme clearly
identifies that it is neutral to access method and network segment
ownership, so it is more likely to be perceived as trustworthy.  If
*you* wanted to write a book review about a book, and wanted to be sure
that the maximum number of people could find your review, which URI
scheme would *you* use?

Finally, it's rather short-sighted.  What happens tomorrow when there
are pervasive ways to GET resource representations asynchronously?  What
about purely P2P systems like Freenet?  It would surely be a shame to
have all of your important objects identified with some sad and lonely
http: stuck to the front when HTTP is obsolete and nobody does
synchronous GET over port 80 anymore.






[1] http://www.w3.org/DesignIssues/Axioms.html

> -----Original Message-----
> From: Tim Bray [mailto:tbray@textuality.com]
> Sent: Thursday, July 25, 2002 4:10 PM
> To: www-tag@w3.org
> 
> 
> Joshua Allen wrote:
> 
> > In other words, for HTTP this is simply sophistry.  If we ever want
the
> > web to progress beyond the shackles of synchronous HTTP GET, we need
to
> > deal with the problem as a practical matter.
> 
> Why? -Tim
Received on Friday, 26 July 2002 03:21:04 UTC