Re: URI Comparisons: RFC 2616 vs. RDF from Nathan on 2011-01-20 (public-lod@w3.org from January 2011)

From: Nathan <nathan@webr3.org>
Date: Thu, 20 Jan 2011 14:29:35 +0000
To: Dave Reynolds <dave.e.reynolds@gmail.com>
CC: "public-lod@w3.org" <public-lod@W3.ORG>, Alan Ruttenberg <alanruttenberg@gmail.com>
Message-ID: <4D3846CF.9040701@webr3.org>
Hi Dave,

Generally I agree, will address a few specific points in line (just to 
address them) then summarize my intended goals at the end (being the 
substance of the mail).

Dave Reynolds wrote:
> The URI spec (rfc3986[1]) does allow this usage. In particular Section 6
> Normalization and Comparison says:
> 
> """URI comparison is performed for some particular purpose.  Protocols 
> or implementations that compare URIs for different purposes will
>    often be subject to differing design trade-offs in regards to how
>    much effort should be spent in reducing aliased identifiers.  This
>    section describes various methods that may be used to compare URIs,
>    the trade-offs between them, and the types of applications that might
>    use them."""
> 
> and
> 
> """We use the terms "different" and
>    "equivalent" to describe the possible outcomes of such comparisons,
>    but there are many application-dependent versions of equivalence."""
> 
> While RDF predates this spec it seems to me that the RDF usage remains
> consistent with it. The purpose of comparison in RDF is different from
> that of cache retrieval of web pages or message delivery of email.

Indeed, I also read though:

    For all URIs, the hexadecimal digits within a percent-encoding
    triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
    should be normalized to use uppercase letters for the digits A-F.

    When a URI uses components of the generic syntax, the component
    syntax equivalence rules always apply; namely, that the scheme and
    host are case-insensitive and therefore should be normalized to
    lowercase...
    - http://tools.ietf.org/html/rfc3986#section-6.2.2.1

And took the "For all" and "always" to literally mean "for all" and 
"always".

Unsure where this leaves things, and which takes precedence.

> This quote also makes clear that there is no single definitive
> normalization. There are different levels of normalization possible
> depending on your needs. 

agree

> So I claim that in terms of formal published specifications:
> (1) RDF, OWL and RIF do not require any normalization of URIs (beyond
> the character encoding level) and compare URIs by simple string
> comparison.

One potential issue on the % encoding, clarified further down.

> (2) This usage is *not* precluded by the URI specs, at least by 3986
> which sets the current framework for the application of scheme-specific
> specs.

Not a 100% sure but tempted to agree with you, would make sense not to 
preclude it.

> As we've already mentioned :) there are no specs for linked data so we
> move onto more subjective grounds.

Would be nice to get some specs at some point...

> The linked data convention is that dereferencing some URI U in your RDF
> document should return information about U, including further onward
> links. So if data set A spells a URI hTTp://example.com/foo but the data
> you get from dereferencing that URI talks only about
> http://example.com/foo then someone has a problem somewhere. The
> question is who, where and how to fix it.

agree, good way of putting it.

>> against both the RDF Specification [1] and the URI specification when 
>> they say /not/ to encode permitted US-ASCII characters (like ~ %7E)? 
> 
> Where did that example come from? 

    The encoding consists of... %-escaping octets that do not correspond
    to permitted US-ASCII characters.
    - http://www.w3.org/TR/rdf-concepts/#section-Graph-URIref

    For consistency, percent-encoded octets in the ranges of ALPHA
    (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E),
    underscore (%5F), or tilde (%7E) should not be created by URI
    producers and, when found in a URI, should be decoded to their
    corresponding unreserved characters by URI normalizers.
    - http://tools.ietf.org/html/rfc3986#section-2.3

I read those quotes as saying do not encode permitted US-ASCII 
characters in RDF URI References.

> At what point have we suggested doing that?

As above

>> why 
>> force case-sensitive matching on the scheme and domain on URIs matching 
>> the generic syntax when the specs say must be compared case 
>> insensitively?
> 
> No, the specs do not say that, see above.

See "for all" and "always" quote earlier on.

> So use normalized URIs in the first place. 
...
> RDF/OWL/RIF aren't designed the way they are because someone thought it
> would be a good idea to allow such things to be used side by side or
> because they *want* people to use denormalized URIs.
...
> The point is that there is no single, simple, universal (i.e. across all
> schemes) normalization algorithm that could be used.
> The current approach gives stable, well-defined behaviour which doesn't
> change as people invent new URI schemes. The RDF serializations give you
> enough control to enable you to be certain about what URI you are
> talking about. Job done.

Okay, I agree, and I'm really not looking to create a lot of work here, 
the general gist of what I'm hoping for is along the lines of:

   RDF Publishers MUST perform Case Normalization and Percent-Encoding 
Normalization on all URIs prior to publishing. When using relative URIs 
publishers SHOULD include a well defined base using a serialization 
specific mechanism. Publishers are advised to perform additional 
normalization steps as specified by URI (RFC 3986) where possible.

   RDF Consumers MAY normalize URIs they encounter and SHOULD perform 
Case Normalization and Percent-Encoding Normalization.

   Two RDF URIs are equal if and only if they compare as equal, 
character by character, as Unicode strings.

For many reasons it would be good to solve this at the publishing phase, 
allow normalization at the consuming phase (can't be precluded as 
intermediary components may normalize), and keep simple case sensitive 
string comparison throughout the stack and specs (so implementations 
remain simple and fast.)

Does anybody find the above disagreeable?

Best, and cheers for the reply Dave,

Nathan
Received on Thursday, 20 January 2011 14:31:45 UTC