RE: %-Encoded (and Non-%-Encoded) URIs in SPARQL Queries from Seaborne, Andy on 2007-07-09 (semantic-web@w3.org from July 2007)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Mon, 9 Jul 2007 11:55:10 +0100
To: "T.Heath" <T.Heath@open.ac.uk>, <semantic-web@w3.org>
Cc: <rdfapi-php-interest@lists.sourceforge.net>, <jena-dev@groups.yahoo.com>
Message-ID: <86FE9B2B91ADD04095335314BE6906E801474BD8@sdcexc04.emea.cpqcorp.net>

-------- Original Message --------
> From: T.Heath <>
> Date: 8 July 2007 14:36
> 
> Hi all,
> 
> I've come across an issue with SPARQL queries over graphs in which
URIs
> vary in their use of %-encoding, and hope members of this list may be
> able to help out...  
> 
> Imagine you have two RDF graphs that reference the same URIs, except
> that in one graph special characters in the URIs are %-encoded, and in
> the second they are not. For example:  
> 
> <http://some.example/example,first> in graph1 vs.
> <http://some.example/example%2Cfirst> in graph2
> 
> As far as I understand it (although I may be wrong) both these URIs
are
> the "same", despite their different syntactic form. However, when
> performing SPARQL queries over the merge of the two graphs these two
> URIs are not treated as the same, therefore making joins of the data
> impossible (without pre-processing). I noticed this behaviour first in
> RAP, but we've been able to replicate the effect in Jena also.     
> 
> So, my question is: is this a bug in RAP, Jena, and presumably other
> frameworks, or are there cases in which this is actually the desired
> behaviour (i.e. it's a feature not a bug)? If the latter is true, does
> this suggest that as a community we need a convention that we will
> always mint and use URIs in which specialchars are %-encoded (or the
> other way around) in order to avoid this kind of situation?     
> 
> Any thoughts/pointers/enlightenment much appreciated,


There is a difference between being an escape mechanism and being an
endocing mechanism.  %2C is not a way to escape a comma into a URI -
it's a way of encoding the information.  The difference is whether the
URI really contains "," (escaping) or whether it really contains "%2C"
(encoding).  In the case of URIs, it's an encoding scheme and the URI
really does contain the "%2C", and not ",".

For example: in a programming language, using \n for newline and the
string "\n" then there is the single newline character in the string,
and it's of length 1, not 2.

RFC3986 gives advice on when to encode (sec 2.4) which is when the URI
is turned from its subcomponents into the URI character string.  Reverse
at the other end.  But while it's a URI character string, it is just a
sequence of charcaters without interpretation of %-encoding.

For RDF, which is not constructing URIs from sub-components, the URI is
the character string. It should not change it (apply %-rules) to do
comparisons.

So my understanding of:
"""
Two RDF URI references are equal if and only if they compare as equal,
character by character, as Unicode strings.
"""
is that it means compare strings as given, not by applying %-decoding

http://some.example/example,first is not the same as
http://some.example/example%2Cfirst.

	Andy


> 
> Cheers,
> 
> Tom.
> 
> P.S. FWIW the Dbpedia community has recently settled on always using
> %-encoded URIs. 
> 
> --
> Tom Heath
> PhD Student
> Knowledge Media Institute
> The Open University
> Walton Hall
> Milton Keynes
> MK7 6AA
> United Kingdom
> 
> Tel: +44 (0)1908 653565
> Fax: +44 (0)1908 653169
> Web/URI: http://kmi.open.ac.uk/people/tom/
> Jabber: t.heath%open.ac.uk@buddyspace.org

Received on Monday, 9 July 2007 10:55:23 UTC