Re: URI Comparisons: RFC 2616 vs. RDF from Dave Reynolds on 2011-01-20 (public-lod@w3.org from January 2011)

From: Dave Reynolds <dave.e.reynolds@gmail.com>
Date: Thu, 20 Jan 2011 13:08:12 +0000
To: nathan@webr3.org
Cc: "public-lod@w3.org" <public-lod@W3.ORG>
Message-ID: <1295528892.2623.251.camel@dave-desktop>
On Wed, 2011-01-19 at 21:45 +0000, Nathan wrote: 
> David Wood wrote:
> > On Jan 19, 2011, at 10:59, Nathan wrote:
> >> ps: as an illustration of how engrained URI normalization is, I've capitalized the domain names in the to: and cc: fields, I do hope the mail still come through, and hope that you'll accept this email as being sent to you. Hopefully we'll also find this mail in the archives shortly at htTp://lists.W3.org/Archives/Public/public-lod/2011Jan/ - Personally I'd hope that any statements made using these URIs (asserted by man or machine) would remain valid regardless of the (incorrect?-)casing.
> > 
> > Heh.  OK, I'll bite.  Domain names in email addressing are defined in IETF RFC 2822 (and its predecessor RFC 822), which defers the interpretation to RFC 1035 ("Domain names - implementation and specification).  RFC 1035 section 2.3.3 states that domain names in DNS, and therefore in (E)SMTP, are to be compared in a case-insensitive manner.
> > 
> > As far as I know, the W3C specs do not so refer to RFC 1035.
> 
> And I'll bite in the other direction, why not treat URIs as URIs? 

It seems to me the underlying question here is whether aliasing of URIs
(whether they dereference to the same resource) should imply semantic
equality (i.e. use as an identifier in a web logic language like RDF or
OWL).

The position so far in RDF, OWL and RIF has been "no"

As far as the specifications for those languages are concerned a URI is
"just" a convenient spelling for an identifier and they require
comparison of identifiers to be stable and context-independent. 
Those specs don't constrain what you get back from dereferencing some
URI U to include statements about U.

The URI spec (rfc3986[1]) does allow this usage. In particular Section 6
Normalization and Comparison says:

"""URI comparison is performed for some particular purpose.  Protocols 
or implementations that compare URIs for different purposes will
   often be subject to differing design trade-offs in regards to how
   much effort should be spent in reducing aliased identifiers.  This
   section describes various methods that may be used to compare URIs,
   the trade-offs between them, and the types of applications that might
   use them."""

and

"""We use the terms "different" and
   "equivalent" to describe the possible outcomes of such comparisons,
   but there are many application-dependent versions of equivalence."""

While RDF predates this spec it seems to me that the RDF usage remains
consistent with it. The purpose of comparison in RDF is different from
that of cache retrieval of web pages or message delivery of email.

This quote also makes clear that there is no single definitive
normalization. There are different levels of normalization possible
depending on your needs. 

Earlier you pointed out that the place where the URI specs and RDF do
collide is in resolving relative URIs into absolute URIs. Again rfc3986
does not preclude the RDF usage. Section 5.2.1 says:

"""Normalization of the base URI, as described in Sections 6.2.2 and 
   6.2.3, is optional."""

So I claim that in terms of formal published specifications:
(1) RDF, OWL and RIF do not require any normalization of URIs (beyond
the character encoding level) and compare URIs by simple string
comparison.
(2) This usage is *not* precluded by the URI specs, at least by 3986
which sets the current framework for the application of scheme-specific
specs.

** Now we turn to linked data ...

As we've already mentioned :) there are no specs for linked data so we
move onto more subjective grounds.

The linked data convention is that dereferencing some URI U in your RDF
document should return information about U, including further onward
links. So if data set A spells a URI hTTp://example.com/foo but the data
you get from dereferencing that URI talks only about
http://example.com/foo then someone has a problem somewhere. The
question is who, where and how to fix it.

It seems to me that this is primarily a issue with publishing, and a
little about being sensible about how you pass on links. If I'm going to
put up some linked data I should mint normalized URIs; I should use the
same spelling of the URIs throughout my data; I'll make sure those URIs
dereference and that the data that comes back is stable and useful. If
someone else refers to my resources using an aliased URI (such as a
different case for the protocol) and makes statements about those
aliases then they have simply made a mistake.

To make sure that dereference returns what I expect, independent of
aliasing, then I should publish data with explicit base URIs (or just
absolute URIs). Publishing with relative URIs and no base is a recipe
for having your data look different from different places. Just don't do
it. No surprise there.

None of this requires us to force URI normalization into the heart of
identifier comparison in RDF itself. It is not a necessary solution and
it is not a sufficient one because there is no universal normalization
algorithm that would make all possible locator aliasing disappear.

> why go 
> against both the RDF Specification [1] and the URI specification when 
> they say /not/ to encode permitted US-ASCII characters (like ~ %7E)? 

Where did that example come from? 
At what point have we suggested doing that?

> why 
> force case-sensitive matching on the scheme and domain on URIs matching 
> the generic syntax when the specs say must be compared case 
> insensitively?

No, the specs do not say that, see above.

> Additionally there's a very nasty, and common, use case which I can't 
> test fully, so would appreciate people taking the time to check their 
> own libraries/clients, as follows:
> 
> If you find some data with the following setup (example):
> 
>    @base <htTp://EXAMPLE.org/foo/bar> .
>    <#t> x:rel <../baz> .
> 
> and then you "follow your nose" to <htTp://EXAMPLE.org/baz>, will you 
> find any triples about it? (problem 1)

Yes if that is the URI the publisher intended to use and has published
data there. If he/she actually uses http://example.org/baz then whoever
gave you he original sample has corrupt data somehow. It so happens that
in that case the dereference will "work" in the sense of give you data
but it will be about the correct URI not the broken one that you've been
given.

> and if there's no base on the 
> second resource, and it uses relative URIs, then the base you'll be 
> using is <htTp://EXAMPLE.org/baz>, and thus, you'll effectively create a 
> new set of statements which the author never wrote, or intended (problem 2).

Correct. Publishing data that way would be a bad idea. That problem need
not, and can't, be solved by changing the comparison of identifiers in
RDF.

> In other words, in this scenario, no matter what you do you're either 
> going to get no data (even though it's there) or get a set of statements 
> which were never said by the author (because the casing is different).

Someone has given you an erroneous URI, you either get no data or get
data which might help you find the right URI. 

> Further, essentially all RDFa ever encountered by a browser has the 
> casing on all URIs in href and src, and all these which are resolved, 
> automatically normalized - so even if you set the base to 
> <htTp://EXAMPLE.org/> or use it in a URI, browser tools, extensions, and 
> js based libraries will only ever see the normalized URIs (and thus be 
> incompatible with the rest of the RDF world).

So use normalized URIs in the first place. 

I've heard no use case for wanting to publish data about the URI
htTp://EXAMPLE.org/ in the first place and if you did then using RDFa as
the means of doing so would be perverse.

> I'll continue on getting the specific examples for current RDF tooling 
> and resources and get it on the wiki, but I'll say now that almost every 
> tool I've encountered so far "does it wrong" in inconsistent 
> non-compatible ways.

The notion of "wrong" hasn't yet been made clear here.

> Finally, I'll ask again, if anybody has any use case which benefits from 
> <htTp://EXAMPLE.org/%7efoo> and <http://example.org/~foo> being classed 
> as different RDF URIs, I'd love to hear it.

Never heard of one. 

RDF/OWL/RIF aren't designed the way they are because someone thought it
would be a good idea to allow such things to be used side by side or
because they *want* people to use denormalized URIs.

The point is that there is no single, simple, universal (i.e. across all
schemes) normalization algorithm that could be used.
The current approach gives stable, well-defined behaviour which doesn't
change as people invent new URI schemes. The RDF serializations give you
enough control to enable you to be certain about what URI you are
talking about. Job done.

Choosing good URIs so they work nicely with your deployment and
publishing solution is an important but different topic. I certainly
think linked data publishers and publishing tools are better off minting
normalized URIs. 

Dave

[1] http://www.ietf.org/rfc/rfc3986.txt
Received on Thursday, 20 January 2011 13:08:57 UTC