- From: Christopher Gutteridge <cjg@ecs.soton.ac.uk>
- Date: Mon, 17 Jan 2011 17:37:33 +0000
- To: nathan@webr3.org
- CC: Kingsley Idehen <kidehen@openlinksw.com>, Martin Hepp <martin.hepp@ebusiness-unibw.org>, public-lod@w3.org, Sandro Hawke <sandro@w3.org>
In the short term, it sounds like there's a gap in the code-ecosystem for a really lightweight tool which took a stream of N-Triples and just output a normalised stream of N-Triples ready for import. The examples below would make a good initial test set for it. I'd write it if I didn't have a bunch of code-bunnies biting my ankles and demanding to be created. As for triple stores; I know that the number of triples-per-second on import can be important, so if you already know you're data is clean you'd want to at least make normalise-on-input optional to improve performance. On 17/01/11 16:57, Nathan wrote: > Kingsley Idehen wrote: >> On 1/17/11 10:51 AM, Martin Hepp wrote: >>> Dear all: >>> >>> RFC 2616 [1, section 3.2.3] says that >>> >>> "When comparing two URIs to decide if they match or not, a client >>> SHOULD use a case-sensitive octet-by-octet comparison of the entire >>> URIs, with these exceptions: >>> >>> - A port that is empty or not given is equivalent to the default >>> port for that URI-reference; >>> - Comparisons of host names MUST be case-insensitive; >>> - Comparisons of scheme names MUST be case-insensitive; >>> - An empty abs_path is equivalent to an abs_path of "/". >>> >>> Characters other than those in the "reserved" and "unsafe" sets (see >>> RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding. >>> >>> For example, the following three URIs are equivalent: >>> >>> http://abc.com:80/~smith/home.html >>> http://ABC.com/%7Esmith/home.html >>> http://ABC.com:/%7esmith/home.html >>> " >>> >>> Does this also hold for identifying RDF resources >> >> Yes, where an RDF resource is a Data Container at an Address (URL). >> Thus, equivalent results for de-referencing a URL en route to >> accessing data. >> >> No, when "resource" also implies an Entity (Data Item or Data Object) >> that is assigned a Name via URI. > > Logically, yes on both counts, we should/could be normalizing these > URIs as we consume and publish using the syntax based normalization > rules [1] which apply to all URI/IRIs with the generic syntax (such as > the examples above) > > Any client consuming data, or server publishing data, can use the > normalization rules, so it stands to reason that it's pretty important > that we all do it to avoid false negatives. > > [1] http://tools.ietf.org/html/rfc3986#section-6.2.2 > > Best, > > Nathan > -- Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248 / Lead Developer, EPrints Project, http://eprints.org/ / Web Projects Manager, ECS, University of Southampton, http://www.ecs.soton.ac.uk/ / Webmaster, Web Science Trust, http://www.webscience.org/
Received on Monday, 17 January 2011 17:38:12 UTC