Re: URI Comparisons: RFC 2616 vs. RDF from Renaud Delbru on 2011-01-17 (public-lod@w3.org from January 2011)

From: Renaud Delbru <renaud.delbru@deri.org>
Date: Mon, 17 Jan 2011 17:10:18 +0000
To: Martin Hepp <martin.hepp@ebusiness-unibw.org>
CC: public-lod@w3.org, Kingsley Idehen <kidehen@openlinksw.com>, dave.e.reynolds@gmail.com
Message-ID: <4D3477FA.7020900@deri.org>

Hi,

I am particularly interested about this issue, because I am currently 
struggling with such a problem within the Sindice project.
Given also the answer of Dave, what would be the best practices within a 
(RDF) system to correctly handle URIs ?

Should the system implements URI normalisation based on the RFC 2616 
exceptions:

       - A port that is empty or not given is equivalent to the default
         port for that URI-reference;
       - Comparisons of host names MUST be case-insensitive;
       - Comparisons of scheme names MUST be case-insensitive;
       - An empty abs_path is equivalent to an abs_path of "/".

and should take care of decoding all percent-encoded characters ?

However, when dealing with percent-encoded character, some cases become 
tricky to handle. For example, some URIs [1] have a space encoded at the 
end of the string. By decoding it, certain systems/applications could 
automatically trim it. Also, some URIs [2] are 'recursively' encoded, 
and need multiple decoding pass before getting the right one.

[1] http://geo.linkeddata.es/resource/Pozo/Moro%2C%20Pou%2047%20o%20del%20
[2] http://sioc-project.org/sioc/user/1%2523user

Any opinions on how to correctly handle URis is welcome. It will be 
useful to have a document for "best practices" for correctly handling 
URIs in a RDF system.

Best,
-- 
Renaud Delbru

On 17/01/11 15:51, Martin Hepp wrote:
> Dear all:
>
> RFC 2616 [1, section 3.2.3] says that
>
> "When comparing two URIs to decide if they match or not, a client  
> SHOULD use a case-sensitive octet-by-octet comparison of the entire
>    URIs, with these exceptions:
>
>       - A port that is empty or not given is equivalent to the default
>         port for that URI-reference;
>       - Comparisons of host names MUST be case-insensitive;
>       - Comparisons of scheme names MUST be case-insensitive;
>       - An empty abs_path is equivalent to an abs_path of "/".
>
>    Characters other than those in the "reserved" and "unsafe" sets (see
>    RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.
>
>    For example, the following three URIs are equivalent:
>
>       http://abc.com:80/~smith/home.html
>       http://ABC.com/%7Esmith/home.html
>       http://ABC.com:/%7esmith/home.html
> "
>
> Does this also hold for identifying RDF resources
>
> a) in theory and
> b) in practice (e.g. in popular triplestores)?
>
> I did not test it yet, but I assume that not all implementations would 
> treat
>
>    http://purl.org/NET/c4dm/event.owl#Event
>    HTTP://purl.org/NET/c4dm/event.owl#Event
>    http://PURL.org/NET/c4dm/event.owl#Event
>    http://purl.org:80/NET/c4dm/event.owl#Event
>
> as the same class.
>
> Any facts or opinions?
>
> Best
>
> Martin
>
>
> [1] http://www.ietf.org/rfc/rfc2616.txt
>
> --------------------------------------------------------
> martin hepp
> e-business & web science research group
> universitaet der bundeswehr muenchen
>
> e-mail:  hepp@ebusiness-unibw.org
> phone:   +49-(0)89-6004-4217
> fax:     +49-(0)89-6004-4620
> www:     http://www.unibw.de/ebusiness/ (group)
>          http://www.heppnetz.de/ (personal)
> skype:   mfhepp
> twitter: mfhepp
>
>

Received on Monday, 17 January 2011 17:10:52 UTC