Re: Why Hash? from Aaron Swartz on 2002-04-07 (www-rdf-interest@w3.org from April 2002)

From: Aaron Swartz <me@aaronsw.com>
Date: Sun, 07 Apr 2002 12:59:02 -0500
To: Sandro Hawke <sandro@w3.org>
CC: RDF-Interest <www-rdf-interest@w3.org>
Message-ID: <B8D5F316.30073%me@aaronsw.com>
On 2002-04-07 12:19 PM, "Sandro Hawke" <sandro@w3.org> wrote:

> Interesting.  I've been thinking I need to write a "Why Hash? (Why Use
> URI-References as RDF Identifiers)" paper.  I'll try out the argument
> now.

I'm happy to skewer it. ;-)
 
> I think any string which wont be accidentally reused makes a decent
> universal identifier.  UUIDs/GUIDs/tags, are fine for this.
> 
> Unfortunately, they don't help us locate any information about the
> things identified.

Huh? TAGs give us an email address or a domain name, sure those things are
useful for locating information. Even if they are not, there are more
general solutions discussed below.

> It would be very nice to use RDF identifiers kind
> of like web address: you see one on the side of a bus, you type it in,
> and you get some interesting information.  For this to work with
> UUIDs, we'd need something like google in the background.  Seems like
> a bad idea.

I assume that by Google you mean some sort of centralized system. This
simply isn't true. There are a number of decentralized hash table (DHTs)
systems which make the UUID->content mapping easy with no centralization.
Many people are working on such systems and new research is making them
better and better with each passing week.

There are also IETF systems like RESCAP that allow for resolution of URNs
and other such systems.

The assumption that we need to tie ourselves to centralized systems for
secure naming is absurd. As Zooko's Law[1] states: "Names: Decentralized,
Secure, Human-Memorizable: Choose Two". For decentralized and secure systems
we have a whole series of tools at our disposal (DHTs, cryptographic hashes,
digital signatures, etc.) and for human-readable and secure names we have
tools that allow anyone two people have in common to be a centralization
point (Pet Names[2], Google, DNS).

[1] http://zooko.com/distnames.html
[2] http://www.erights.org/elib/capability/pnml.html

The reason URIs have always appealed to me is that they allow one to choose
any of these trade-offs while staying within a well-known system. While
DNS-based URIs are currently popular, I expect that with the advent of
systems like DHTs and Pet Names that their popularity will fall off.
 
> The URI-Reference approach (which I've adopted, after flirting with
> tag URIs) is to use URI-References as object identifiers, and URIs as
> knowledge-base identifiers.

The problem with this as Roy Fielding, Uche Ogbuji and I point out is that
you're tying your identifiers to the specifics of a serialization syntax and
closing yourself off from all the tools the Web has for systems (redirects,
content-negotiation, access-control, 404s).

HTTP URIs provide a useful system of hierarchical delegation. URI-References
put you at the mercy of whoever has posted the latest MIME type draft for
your serialization syntax. With HTTP URIs I can tell if the author has
created the URI or not (do I get a 404?) but with URI-references whomever
wrote the fragment spec can create all the URI-refs and meanings for tham
that they want. 

> A nearby approach, which I don't like, is to use URIs to denote
> everything.  With this plan, the owner has the same ability to publish
> easily-found information, but the whole system seems more confusing.

Confusing is in the eye of the beholder. There are a zillion confusing
corner-cases with fragments, but I can see only one with full URIs (one that
recent Internet-Drafts are attempting to solve). Some of the corner-cases:

 - What happens when (like with XPointer) someone retroactively adds new
fragments to your document? If someone links to your 9th <p> tag, must you
always have 9 <p> tags in the document?

 - What happens when the document changes? Does the meaning of the URI-ref
change?

 - What happens when you get back (via con-neg) a serialization syntax in
which the fragment is illegal? [Think of getting back HTML where fragments
with numbers are illegal, or an audio file where they are required.]

 - If the fragment is tied to a series of bits (like a section of an audio
file) does the meaning of the fragment change when the bits change? (i.e.
Are fragments late-bound?)

Fragments are cloaked in mystery and differing interpretations of myriad
specs, this is not a sound way to build a global identifier space.

> Now we're back to wondering what exactly http://www.w3.org/ denotes.

The recent Internet-Draft proposing Repr-Type and Resource-Type headers
makes this clear. If you get a

Resource-Type: http://xmlns.com/foaf/0.1/Person

header, then it's a person. Alternately, you can just ask the W3C or stick
with "safe" URI systems like UUIDs, GUIDs, TAGs or URNs.

> With the previous plan it's clear: it denote a collection of
> information (published by the W3C, probably about the W3C and other
> things). 

"A collection of information" doesn't seem particularly clear to me...

> If you use URIs for everything, you're essentially running a
> great risk of accidental identifier re-use.

I think this is hardly as serious as the accidental-re-use problems with
fragments. What happens if I point to a section of an MP3 file where TimBL
describes "test of independent invention" and then someone adds an
introduction to the MP3 bumping my selected fragment-space to a bunch of
applause?

I think you can see the problems now, but if you can't I'm happy to discuss
it further.

All the best,
-- 
      "Aaron Swartz"      |              The Semantic Web
 <mailto:me@aaronsw.com>  |  <http://logicerror.com/semanticWeb-long>
<http://www.aaronsw.com/> |        i'm working to make it happen
Received on Sunday, 7 April 2002 13:59:06 UTC