Re: Hashed URIs from Graham Klyne on 2002-11-25 (ietf-discuss@w3.org from November 2002)

From: Graham Klyne <GK@Ninebynine.org>
Date: Mon, 25 Nov 2002 17:15:08 +0000
To: Paul Hoffman / IMC <phoffman@imc.org>
Cc: "Clive D.W. Feather" <clive@demon.net>, discuss@apps.ietf.org, uri@w3.org
Message-Id: <5.1.0.14.2.20021125170550.0421d800@127.0.0.1>

Responding to [4] (http://lists.w3.org/Archives/Public/uri/2002Nov/0024.html):

At 08:49 AM 11/25/02 -0800, Paul Hoffman / IMC wrote:
>3) Which leads to the biggest question: where are the real-world uses for 
>this? Which actual content providers have said "I don't want to reveal the 
>real URL for the thing you are comparing"? If those folks exist, they 
>could easily answer the first two objections. Quite frankly, this seems 
>like a cute and well thought-out idea searching for a problem.

Hi Paul,

There is one possible use-case of which I'm aware.

A group of folks are experimenting with RDF metadata in a project called 
RDFWeb [1].  This is currently a research community project that has 
attracted some wider interest (e.g. [2]).  One of the concerns is the use 
of email addresses in the form of mailto: URLs, used as one way of 
identifying people, being too much like easy targets for spammers.  The 
idea of using hashed URIs has been surfaced [3] to maintain their 
identifying properties but protecting the corresponding email address.

I generally agree with the sentiment of the other points you make.  SHA-1 
seems like a good choice for the hashing function.

...

Further comments, concerning [5]:

- this registers a new URI scheme, which IIRC requires a standards-track RFC.

- what does this URI actually identify?  The same thing as the hashed URI 
identifies, or the hashed URI itself.  I think it's the latter, so this 
could reasonably be presented as a URN namespace [7].  This would call for 
an RFC, but not necessarily standards track.

- I think some attempt at canonicalization may be useful (e.g. per section 
4.3), but I think that appendix A attempts too much, and may result in some 
unreasonable failures (especially changing the file extensions).  When 
using hashed URIs, I think one should presume that the URI has been 
obtained directly or indirectly from a definitive source before hashing.

In the use-case of which I'm aware, it is reasonable to assume that the URI 
has been obtained authoritatively or common variations may be 
stored.  Further, an application is not prevented from trying some simple 
heuristics to obtain a match.

- I'm intrigued what purpose the +frag, +query flags serve.  Also, in 
including fragment identifiers as part of the hashed URI, users should be 
aware that some uses of URIs require that any fragment be stripped before 
the URI is used (e.g. for retrieving a web resource; see [6]).  I guess 
similar considerations apply to the query part with relation to the 
resource on the target server.  So maybe I do see some point, if only as a 
diagnostic for mismatches.

- I suggest that the case of alpha values in identifiers be normalized 
(i.e. always a-z or A-Z), or the URI be declared case sensitive, to 
simplify hash value comparison.  (Er, that is, if identifiers are really 
required.  Per Paul's suggestion, this might be deleted.  At least, start 
out with just one.)

#g
--

[1] http://www.rdfweb.org/

[2] http://www-106.ibm.com/developerworks/xml/library/x-foaf.html

[3] http://www.w3.org/2001/12/rubyrdf/util/foafwhite/intro.html

[4] http://lists.w3.org/Archives/Public/uri/2002Nov/0024.html

[5] http://www.ietf.org/internet-drafts/draft-feather-hashed-uri-03.txt

[6] http://www.w3.org/DesignIssues/Fragment.html

[7] http://www.ietf.org/rfc/rfc2611.txt

-------------------
Graham Klyne
<GK@NineByNine.org>

Received on Monday, 25 November 2002 13:23:36 UTC