Re: %-Encoded (and Non-%-Encoded) URIs in SPARQL Queries from Richard Newman on 2007-07-08 (semantic-web@w3.org from July 2007)

From: Richard Newman <rnewman@twinql.com>
Date: Sun, 8 Jul 2007 11:58:15 -0700
To: "T.Heath" <T.Heath@open.ac.uk>
Cc: <semantic-web@w3.org>, <rdfapi-php-interest@lists.sourceforge.net>, <jena-dev@groups.yahoo.com>
Message-Id: <79F3A45B-2998-41B0-AC81-73ED13F1A495@twinql.com>

Hi Tom,

On  8 Jul 2007, at 6:36 AM, T.Heath wrote:

> As far as I understand it (although I may be wrong) both these URIs  
> are
> the "same", despite their different syntactic form.

It's the RDF store's job to identify them as the same resource (**if  
they really are**), not really the SPARQL engine's. (The distinction  
is arbitrary, of course.)

Most RDF stores take the efficiency route, not correctness, because  
parsing and normalizing every resource (and the same goes for  
language tags, which are case-insensitive) is expensive.

Also, reserved characters (and ',' is one) are reserved precisely  
because their meaning changes under escaping:

>> Thus, characters in the reserved set are protected from  
>> normalization and are therefore safe to be used by scheme-specific  
>> and producer-specific algorithms for delimiting data subcomponents  
>> within a URI.

RFC3986 seems to think that commas are valid unescaped in a path  
segment, so in this case the unescaped form is right.

> So, my question is: is this a bug in RAP, Jena, and presumably other
> frameworks, or are there cases in which this is actually the desired
> behaviour (i.e. it's a feature not a bug)?

It sounds more like a tradeoff, rather than a bug. (I'm sure there  
are some implementations that do the right thing, though, but they'll  
be slower than the rest.)

> If the latter is true, does
> this suggest that as a community we need a convention that we will
> always mint and use URIs in which specialchars are %-encoded (or the
> other way around) in order to avoid this kind of situation?

All URIs you generate and use should be valid and normalized: that  
means escaping special characters where they are meant to be escaped.  
There should only be one acceptable encoding for a given URI. See  
sections 2.4 and 6.2 of RFC3986.

>> Many URI include components consisting of or delimited by, certain
>> special characters. These characters are called "reserved", since
>> their usage within the URI component is limited to their reserved
>> purpose. If the data for a URI component would conflict with the
>> reserved purpose, then the conflicting data must be escaped before
>> forming the URI.

I highly doubt that all of your data sources agree on using %7F  
instead of %7f, or on which characters to escape, though. *sigh*

FWIW, you could hack around it in SPARQL; (pseudocode):

   ?x a Person .
   ?y knows jim .
   FILTER ( ag:uridecode-string(str(?x)) = ag:uridecode-string(str(? 
y)) )

.... stripping out encoding before comparing URIs. This could produce  
unintended side-effects.

> P.S. FWIW the Dbpedia community has recently settled on always using
> %-encoded URIs.

I hope you only encode the correct parts! :)

HTH,

-R

Received on Monday, 9 July 2007 04:31:19 UTC