Re: URI Comparisons: RFC 2616 vs. RDF from Dave Reynolds on 2011-01-20 (public-lod@w3.org from January 2011)

From: Dave Reynolds <dave.e.reynolds@gmail.com>
Date: Thu, 20 Jan 2011 22:22:51 +0000
To: nathan@webr3.org
CC: "public-lod@w3.org" <public-lod@W3.ORG>
Message-ID: <4D38B5BB.8070108@gmail.com>
Hi Nathan,

I largely agree but have a few quibbles :)

On 20/01/2011 2:29 PM, Nathan wrote:
> Dave Reynolds wrote:
>> The URI spec (rfc3986[1]) does allow this usage. In particular Section 6
>> Normalization and Comparison says:
>>
>> """URI comparison is performed for some particular purpose. Protocols
>> or implementations that compare URIs for different purposes will
>> often be subject to differing design trade-offs in regards to how
>> much effort should be spent in reducing aliased identifiers. This
>> section describes various methods that may be used to compare URIs,
>> the trade-offs between them, and the types of applications that might
>> use them."""
>>
>> and
>>
>> """We use the terms "different" and
>> "equivalent" to describe the possible outcomes of such comparisons,
>> but there are many application-dependent versions of equivalence."""
>>
>> While RDF predates this spec it seems to me that the RDF usage remains
>> consistent with it. The purpose of comparison in RDF is different from
>> that of cache retrieval of web pages or message delivery of email.
>
> Indeed, I also read though:
>
> For all URIs, the hexadecimal digits within a percent-encoding
> triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
> should be normalized to use uppercase letters for the digits A-F.
>
> When a URI uses components of the generic syntax, the component
> syntax equivalence rules always apply; namely, that the scheme and
> host are case-insensitive and therefore should be normalized to
> lowercase...
> - http://tools.ietf.org/html/rfc3986#section-6.2.2.1
>
> And took the "For all" and "always" to literally mean "for all" and
> "always".

Those quotes come from section (6.2.2) describing normalization but the 
earlier quote is from the start of section 6 saying that choice of 
normalization is application dependent. I interpret the two together as 
"*if* you are normalizing then always ...blah ...".

That was certainly the RIF position where we explicitly said that 
sections 6.2.2 and 6.2.3 of rfc3986 were not applicable.

>>> against both the RDF Specification [1] and the URI specification when
>>> they say /not/ to encode permitted US-ASCII characters (like ~ %7E)?
>>
>> Where did that example come from?
>
> The encoding consists of... %-escaping octets that do not correspond
> to permitted US-ASCII characters.
> - http://www.w3.org/TR/rdf-concepts/#section-Graph-URIref
>
> For consistency, percent-encoded octets in the ranges of ALPHA
> (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E),
> underscore (%5F), or tilde (%7E) should not be created by URI
> producers and, when found in a URI, should be decoded to their
> corresponding unreserved characters by URI normalizers.
> - http://tools.ietf.org/html/rfc3986#section-2.3
>
> I read those quotes as saying do not encode permitted US-ASCII
> characters in RDF URI References.
>
>> At what point have we suggested doing that?
>
> As above

Sorry, I didn't mean to dispute that you shouldn't %-encode ~, I was 
wondering where the suggestion that you should do so came from.

I believe there are some corner cases, such as the handling of spaces, 
which differ between the RDF spec and the IRI spec. This was down to 
timing. The RDF Core WG was doing its best to anticipate what the IRI 
spec would look like but couldn't wait until that was finalized. 
Resolving any such small discrepancies between that anticipation and the 
actual IRI specs is something I believe to be in scope for the proposed 
new RDF WG.

>> So use normalized URIs in the first place.
> ...
>> RDF/OWL/RIF aren't designed the way they are because someone thought it
>> would be a good idea to allow such things to be used side by side or
>> because they *want* people to use denormalized URIs.
> ...
>> The point is that there is no single, simple, universal (i.e. across all
>> schemes) normalization algorithm that could be used.
>> The current approach gives stable, well-defined behaviour which doesn't
>> change as people invent new URI schemes. The RDF serializations give you
>> enough control to enable you to be certain about what URI you are
>> talking about. Job done.
>
> Okay, I agree, and I'm really not looking to create a lot of work here,
> the general gist of what I'm hoping for is along the lines of:
>
> RDF Publishers MUST perform Case Normalization and Percent-Encoding
> Normalization on all URIs prior to publishing. When using relative URIs
> publishers SHOULD include a well defined base using a serialization
> specific mechanism. Publishers are advised to perform additional
> normalization steps as specified by URI (RFC 3986) where possible.
>
> RDF Consumers MAY normalize URIs they encounter and SHOULD perform Case
> Normalization and Percent-Encoding Normalization.
>
> Two RDF URIs are equal if and only if they compare as equal, character
> by character, as Unicode strings.

I sort of OK with that but ...

Terms like "RDF Publisher" and "RDF Consumer" need to be defined in 
order to make formal statements like these. The RDF/OWL/RIF specs are 
careful to define what sort of processors are subject to conformance 
statements and I don't think RDF Publisher is a conformance point for 
the existing specs.

This may sound like nit-picking that's life with specifications. You 
need to be clear how the last para about "RDF URIs" relates to notions 
like "RDF Consumer".

I wonder whether you might want to instead define notions of Linked Data 
Publisher and Linked Data Consumer to which these MUST/MAY/SHOULD 
conformance statements apply. That way it is clear that a component such 
as an RDF store or RDF parser is correct in following the existing RDF 
specs and not doing any of these transformations but that in order to 
construct a Linked Data Consumer/Publisher some other component can be 
introduced to perform the normalizations. Linked Data as a set of 
constraints and conventions layered on top of the RDF/OWL specs.

The specific point on the normalization ladder would have to defined, of 
course, and you would need to define how to handle schemes unknown to 
the consumer.

All this presupposes some work to formalize and specify linked data. Is 
there anything like that planned?  In some ways Linked Data is an 
engineering experiment and benefits from that freedom to experiment. On 
the other hand interoperability eventually needs clear specifications.

> For many reasons it would be good to solve this at the publishing phase,
> allow normalization at the consuming phase (can't be precluded as
> intermediary components may normalize), and keep simple case sensitive
> string comparison throughout the stack and specs (so implementations
> remain simple and fast.)

Agreed.

Dave
Received on Thursday, 20 January 2011 22:23:31 UTC