Re: Why Hash? from Sandro Hawke on 2002-04-08 (www-rdf-interest@w3.org from April 2002)

From: Sandro Hawke <sandro@w3.org>
Date: Mon, 08 Apr 2002 13:34:20 -0400
To: Aaron Swartz <me@aaronsw.com>
cc: RDF-Interest <www-rdf-interest@w3.org>
Message-Id: <200204081734.g38HYKQ12962@wadimousa.hawke.org>
> On 2002-04-07 12:19 PM, "Sandro Hawke" <sandro@w3.org> wrote:
> 
> > Interesting.  I've been thinking I need to write a "Why Hash? (Why Use
> > URI-References as RDF Identifiers)" paper.  I'll try out the argument
> > now.
> 
> I'm happy to skewer it. ;-)

Hey, I studied fencing for years.    Try it, man.   :-)

Seriously, most of your reply seems to be based on the mistaken idea
that a URI-Reference necessarily denotes a fragment of a document.
This is understandable, given the use of the word "fragment" in RFC
2396, but it's not right.  VRML put 3d-coordinates in the "fragment",
using it more as a viewing-angle, which is mostly how it's used in
HTML.

RDF can put arbitrary text there, and say the overall URI-Reference is
a logical constant symbol.    Served content can be consistent in its use
of the fragment (across all media types it serves) and clients should
only ask for media types they properly understand.   (I go into more
detail on this below, but the interesting stuff is all up here.)

Can we agree on this: we want semantic web client software to be able
to use a URI-Reference to obtain an RDF graph in which some node
SHOULD be labeled with the identifier.  The "SHOULD" applies to
the agency which coined the identifier and maintains various related
servers, not the client software.

We can call this a GET operation, and REST-fans can call the RDF graph
serialization a "representation" of the thing denoted by the
identifier, if they want to.  I find it more clear to call that graph
"information about" the thing.  (As I am used to using the word
"representation", it's much closer to serialization, and that's where
I get the idea that the URI denotes the RDF graph.  But if you're fine
with saying that a description of a thing is a representation of a
thing, then I'm fine with saying that URIs can denote more than
descriptions.)

The Hash vs. Slash argument then comes down to this: Should every
subject have its own web page, or should each page have information
about several subjects?  If we allow pages to have information about
multiple subjects, then we do have to contend with the media-types &
fragment-id confusion; if we require one subject per page, we may get
confused about what the subject is.   

I'm afraid both camps are here to stay and we'll have to deal with
both confusions.  Each one should of course keeping finding ways to be
more clear.

The term "fragment" is unfortunately.  Terms like "focus" and
"subject" are more-accurate descriptions of what HTML viewers do, and
fit nicely with RDF's use.

[The rest of this e-mail is point-by-point, written first. ]

     -- sandro


> >
> > I think any string which wont be accidentally reused makes a decent
> > universal identifier.  UUIDs/GUIDs/tags, are fine for this.
> > 
> > Unfortunately, they don't help us locate any information about the
> > things identified.
> 
> Huh? TAGs give us an email address or a domain name, sure those things are
> useful for locating information. Even if they are not, there are more
> general solutions discussed below.

The email address and domain names in tag uris are like the ethernet
address in a UUID: there for the algorithm, not as contact
information.   They do give you a hint, but they're not intended to be
used that way.   If they start to get used that way, the system is
likely to break.

> > It would be very nice to use RDF identifiers kind
> > of like web address: you see one on the side of a bus, you type it in,
> > and you get some interesting information.  For this to work with
> > UUIDs, we'd need something like google in the background.  Seems like
> > a bad idea.
> 
> I assume that by Google you mean some sort of centralized system. This
> simply isn't true. There are a number of decentralized hash table (DHTs)
> systems which make the UUID->content mapping easy with no centralization.
> Many people are working on such systems and new research is making them
> better and better with each passing week.
> 
> There are also IETF systems like RESCAP that allow for resolution of URNs
> and other such systems.
> 
> The assumption that we need to tie ourselves to centralized systems for
> secure naming is absurd. As Zooko's Law[1] states: "Names: Decentralized,
> Secure, Human-Memorizable: Choose Two". For decentralized and secure systems
> we have a whole series of tools at our disposal (DHTs, cryptographic hashes,
> digital signatures, etc.) and for human-readable and secure names we have
> tools that allow anyone two people have in common to be a centralization
> point (Pet Names[2], Google, DNS).
> 
> [1] http://zooko.com/distnames.html
> [2] http://www.erights.org/elib/capability/pnml.html

Great stuff.   I hope it pans out.   I need to look into it more.

I don't think any of this really pertains to this argument.  These
systems may provide alternate ways of maintaining a world-wide mapping
of short bit-strings (URIs) to longer bit-strings (content), but I
don't see how that affects things.   I suppose they wont allow content
negotiation, which may complicate things.

> The reason URIs have always appealed to me is that they allow one to choose
> any of these trade-offs while staying within a well-known system. While
> DNS-based URIs are currently popular, I expect that with the advent of
> systems like DHTs and Pet Names that their popularity will fall off.

Agreed.   URIs are great in having this flexibility.

> > The URI-Reference approach (which I've adopted, after flirting with
> > tag URIs) is to use URI-References as object identifiers, and URIs as
> > knowledge-base identifiers.
> 
> The problem with this as Roy Fielding, Uche Ogbuji and I point out is that
> you're tying your identifiers to the specifics of a serialization syntax and
> closing yourself off from all the tools the Web has for systems (redirects,
> content-negotiation, access-control, 404s).
>
> HTTP URIs provide a useful system of hierarchical delegation. URI-References
> put you at the mercy of whoever has posted the latest MIME type draft for
> your serialization syntax. With HTTP URIs I can tell if the author has
> created the URI or not (do I get a 404?) but with URI-references whomever
> wrote the fragment spec can create all the URI-refs and meanings for tham
> that they want. 

I don't get this "mercy" argument.  Clients should only say they
accept media types they can understand [1] and the IETF should not
allow incompatibly changes to the meaning of any media type.  So a
client asks for application/rdf+xml and it gets back
application/rdf+xml (or an error).  The RDF graph serialized in the
returned contents is likely to have nodes labeled with the
URI-reference the client was trying to learn about, or maybe have
information which will help lead the client to such graphs.

If my client uses the blindfold 2.0 library (which I'll be shipping
any day now :-) it will ask for a much broader range of media types,
all of which it knows how to turn into RDF Graphs.  As with
application/rdf+xml, the nodes in the RDF graph may be labeled with
the URI-Reference of interest.

When I hear people talk about web pages and the web, I hear them treat
web pages and web sites as sources of information, as collections of
information, the way they talk about books and libraries.
Interactive (POST-driven) sites are discussed as things you might also
send information to.    

REST suggests (to my limitted understanding) that GET gives you a
representation of the identified thing.  A representation of IBM, of
TimBL, of the weather in NYC, ....  This makes no sense to me.  I
think you get back information about IBM, TimBL, the weather in NYC,
etc.  What is information about something?  It's an RDF graph where
that something is denoted by a node in the RDF graph.

> > A nearby approach, which I don't like, is to use URIs to denote
> > everything.  With this plan, the owner has the same ability to publish
> > easily-found information, but the whole system seems more confusing.
> 
> Confusing is in the eye of the beholder. There are a zillion confusing
> corner-cases with fragments, but I can see only one with full URIs (one that
> recent Internet-Drafts are attempting to solve). Some of the corner-cases:
> 
>  - What happens when (like with XPointer) someone retroactively adds new
> fragments to your document? If someone links to your 9th <p> tag, must you
> always have 9 <p> tags in the document?

URI-References do not necessarily denote parts of the document.   In
VRML documents, they denote points in 3-space.   In RDF documents,
they denote arbitrary things in the domain of discourse.

I don't see how your corner case here could possibly apply to RDF.

>  - What happens when the document changes? Does the meaning of the URI-ref
> change?

No, the denotation of a semantic web identifier (such as a URI-Reference
which can be used to get an RDF document) is always the same object,
by definition.    Like <mailto:me@aaronsw.com> always denotes that
mailbox. 

>  - What happens when you get back (via con-neg) a serialization syntax in
> which the fragment is illegal? [Think of getting back HTML where fragments
> with numbers are illegal, or an audio file where they are required.]

Don't serve content like that.

>  - If the fragment is tied to a series of bits (like a section of an audio
> file) does the meaning of the fragment change when the bits change? (i.e.
> Are fragments late-bound?)

I'm not talking about fragments, I'm talking about URI-References.

> Fragments are cloaked in mystery and differing interpretations of myriad
> specs, this is not a sound way to build a global identifier space.

Alas.    Still they have many practical features.

> > Now we're back to wondering what exactly http://www.w3.org/ denotes.
> 
> The recent Internet-Draft proposing Repr-Type and Resource-Type headers
> makes this clear. If you get a
> 
> Resource-Type: http://xmlns.com/foaf/0.1/Person
> 
> header, then it's a person. Alternately, you can just ask the W3C or stick
> with "safe" URI systems like UUIDs, GUIDs, TAGs or URNs.

Why put that in the headers?   Why not just include in the content
(assuming it's n3 for this example)....
      <> a foaf:Person.

That requires no IETF action, and it means the same thing, doesn't it?

> > With the previous plan it's clear: it denote a collection of
> > information (published by the W3C, probably about the W3C and other
> > things). 
> 
> "A collection of information" doesn't seem particularly clear to me...

That's my informal way to say "an RDF Graph" or, more generally "a
logical formula in some logic".

> I think you can see the problems now, but if you can't I'm happy to discuss
> it further.

Good.  :-)

I wonder if a distillation of this argument exists or should exist in
the RDF primer.   Something giving the practical pros and cons of hash,
slash, bNodes, and ungetables (tags, uuids).   I guess I should update
some of my old writings on this.

       -- sandro

[1] http://www.w3.org/TR/cuap#cp-http-accept
Received on Monday, 8 April 2002 13:36:25 UTC