Re: %-Encoded (and Non-%-Encoded) URIs in SPARQL Queries from Alan Ruttenberg on 2007-07-09 (semantic-web@w3.org from July 2007)

From: Alan Ruttenberg <alanruttenberg@gmail.com>
Date: Mon, 9 Jul 2007 00:05:44 -0400
To: T.Heath <T.Heath@open.ac.uk>
Cc: <semantic-web@w3.org>, <rdfapi-php-interest@lists.sourceforge.net>, <jena-dev@groups.yahoo.com>
Message-Id: <648BAD39-CEEC-4C7A-9098-0551B2FF505F@gmail.com>

I think it's a bug in the implementations.  I base this on http:// 
www.w3.org/TR/rdf-concepts/#section-Graph-URIref

>  6.4 RDF URI References
> A URI reference within an RDF graph (an RDF URI reference) is a  
> Unicode string [UNICODE] that:
>
> does not contain any control characters ( #x00 - #x1F, #x7F-#x9F)
> and would produce a valid URI character sequence (per RFC2396  
> [URI], sections 2.1) representing an absolute URI with optional  
> fragment identifier when subjected to the encoding described below.
> The encoding consists of:
>
> encoding the Unicode string as UTF-8 [RFC-2279], giving a sequence  
> of octet values.
> %-escaping octets that do not correspond to permitted US-ASCII  
> characters.
> The disallowed octets that must be %-escaped include all those that  
> do not correspond to US-ASCII characters, and the excluded  
> characters listed in Section 2.4 of [URI], except for the number  
> sign (#), percent sign (%), and the square bracket characters re- 
> allowed in [RFC-2732].
>
> Disallowed octets must be escaped with the URI escaping mechanism  
> (that is, converted to %HH, where HH is the 2-digit hexadecimal  
> numeral corresponding to the octet value).
>
> Two RDF URI references are equal if and only if they compare as  
> equal, character by character, as Unicode strings.

-Alan

Jul 8, 2007, at 9:36 AM, T.Heath wrote:

>
> Hi all,
>
> I've come across an issue with SPARQL queries over graphs in which  
> URIs
> vary in their use of %-encoding, and hope members of this list may be
> able to help out...
>
> Imagine you have two RDF graphs that reference the same URIs, except
> that in one graph special characters in the URIs are %-encoded, and in
> the second they are not. For example:
>
> <http://some.example/example,first> in graph1 vs.
> <http://some.example/example%2Cfirst> in graph2
>
> As far as I understand it (although I may be wrong) both these URIs  
> are
> the "same", despite their different syntactic form. However, when
> performing SPARQL queries over the merge of the two graphs these two
> URIs are not treated as the same, therefore making joins of the data
> impossible (without pre-processing). I noticed this behaviour first in
> RAP, but we've been able to replicate the effect in Jena also.
>
> So, my question is: is this a bug in RAP, Jena, and presumably other
> frameworks, or are there cases in which this is actually the desired
> behaviour (i.e. it's a feature not a bug)? If the latter is true, does
> this suggest that as a community we need a convention that we will
> always mint and use URIs in which specialchars are %-encoded (or the
> other way around) in order to avoid this kind of situation?
>
> Any thoughts/pointers/enlightenment much appreciated,
>
> Cheers,
>
> Tom.
>
> P.S. FWIW the Dbpedia community has recently settled on always using
> %-encoded URIs.
>
> -- 
> Tom Heath
> PhD Student
> Knowledge Media Institute
> The Open University
> Walton Hall
> Milton Keynes
> MK7 6AA
> United Kingdom
>
> Tel: +44 (0)1908 653565
> Fax: +44 (0)1908 653169
> Web/URI: http://kmi.open.ac.uk/people/tom/
> Jabber: t.heath%open.ac.uk@buddyspace.org
>

Received on Monday, 9 July 2007 04:05:52 UTC