Re: URI Comparisons: RFC 2616 vs. RDF from Christopher Gutteridge on 2011-01-17 (public-lod@w3.org from January 2011)

From: Christopher Gutteridge <cjg@ecs.soton.ac.uk>
Date: Mon, 17 Jan 2011 17:37:33 +0000
To: nathan@webr3.org
CC: Kingsley Idehen <kidehen@openlinksw.com>, Martin Hepp <martin.hepp@ebusiness-unibw.org>, public-lod@w3.org, Sandro Hawke <sandro@w3.org>
Message-ID: <EMEW3|a9e7831efcb9c5c322a1bc1fe9be76den0GHbb03cjg|ecs.soton.ac.uk|4D347E5D.7070>

In the short term, it sounds like there's a gap in the code-ecosystem 
for a really lightweight tool which took a stream of N-Triples and just 
output a normalised stream of N-Triples ready for import. The examples 
below would make a good initial test set for it. I'd write it if I 
didn't have a bunch of code-bunnies biting my ankles and demanding to be 
created.


As for triple stores; I know that the number of triples-per-second on 
import can be important, so if you already know you're data is clean 
you'd want to at least make normalise-on-input optional to improve 
performance.

On 17/01/11 16:57, Nathan wrote:
> Kingsley Idehen wrote:
>> On 1/17/11 10:51 AM, Martin Hepp wrote:
>>> Dear all:
>>>
>>> RFC 2616 [1, section 3.2.3] says that
>>>
>>> "When comparing two URIs to decide if they match or not, a client  
>>> SHOULD use a case-sensitive octet-by-octet comparison of the entire
>>>    URIs, with these exceptions:
>>>
>>>       - A port that is empty or not given is equivalent to the default
>>>         port for that URI-reference;
>>>       - Comparisons of host names MUST be case-insensitive;
>>>       - Comparisons of scheme names MUST be case-insensitive;
>>>       - An empty abs_path is equivalent to an abs_path of "/".
>>>
>>>    Characters other than those in the "reserved" and "unsafe" sets (see
>>>    RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.
>>>
>>>    For example, the following three URIs are equivalent:
>>>
>>>       http://abc.com:80/~smith/home.html
>>>       http://ABC.com/%7Esmith/home.html
>>>       http://ABC.com:/%7esmith/home.html
>>> "
>>>
>>> Does this also hold for identifying RDF resources
>>
>> Yes, where an RDF resource is a Data Container at an Address (URL). 
>> Thus, equivalent results for de-referencing a URL en route to 
>> accessing data.
>>
>> No, when "resource" also implies an Entity (Data Item or Data Object) 
>> that is assigned a Name via URI.
>
> Logically, yes on both counts, we should/could be normalizing these 
> URIs as we consume and publish using the syntax based normalization 
> rules [1] which apply to all URI/IRIs with the generic syntax (such as 
> the examples above)
>
> Any client consuming data, or server publishing data, can use the 
> normalization rules, so it stands to reason that it's pretty important 
> that we all do it to avoid false negatives.
>
> [1] http://tools.ietf.org/html/rfc3986#section-6.2.2
>
> Best,
>
> Nathan
>

-- 
Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248

/ Lead Developer, EPrints Project, http://eprints.org/
/ Web Projects Manager, ECS, University of Southampton, http://www.ecs.soton.ac.uk/
/ Webmaster, Web Science Trust, http://www.webscience.org/

Received on Monday, 17 January 2011 17:38:12 UTC