Re: %-Encoded (and Non-%-Encoded) URIs in SPARQL Queries from Tim Berners-Lee on 2007-07-09 (semantic-web@w3.org from July 2007)

From: Tim Berners-Lee <timbl@w3.org>
Date: Mon, 9 Jul 2007 13:07:44 -0400
To: "Seaborne, Andy" <andy.seaborne@hp.com>
Cc: "T.Heath" <T.Heath@open.ac.uk>, <semantic-web@w3.org>, <rdfapi-php-interest@lists.sourceforge.net>, <jena-dev@groups.yahoo.com>
Message-Id: <69499754-6B11-4E6A-AA80-67D80E98B473@w3.org>
I take more or less the opposite view:  It is a Good Thing for systems
to canonicalize URIS on input, or data and query.  I know RDF
does not specify this.  However, the URI spec gives one the ability  
to conclude that the URIs
are equivalent.

There are several levels of canonicalization you can do.
There was a TAG issue about this
http://www.w3.org/2001/tag/issues.html#URIEquivalence-15
"When are two URI variants considered equivalent?"
A draft finding "How to Compare Uniform Resource Identifiers"
http://www.textuality.com/tag/uri-comp-4
  was produced by Tim Bray  about this and the results have been more  
or less folded into
the new URI spec.

http://www.ietf.org/rfc/rfc3986.txt

See section 6

In a way, as the URI spec says you can send the same URI in various  
forms and mean the same thing,
I am not doing users a favor if my system does not recognize this.

In practice, it avoids frustrating bugs like having equivalent URIs  
stand for different things.

Cwm will canonicalize  only with the  --closure=n flag set. Also it  
does canonicalize numbers.

Tim BL


On 2007-07 -09, at 06:55, Seaborne, Andy wrote:

>
> -------- Original Message --------
>> From: T.Heath <>
>> Date: 8 July 2007 14:36
>>
>> Hi all,
>>
>> I've come across an issue with SPARQL queries over graphs in which
> URIs
>> vary in their use of %-encoding, and hope members of this list may be
>> able to help out...
>>
>> Imagine you have two RDF graphs that reference the same URIs, except
>> that in one graph special characters in the URIs are %-encoded,  
>> and in
>> the second they are not. For example:
>>
>> <http://some.example/example,first> in graph1 vs.
>> <http://some.example/example%2Cfirst> in graph2
>>
>> As far as I understand it (although I may be wrong) both these URIs
> are
>> the "same", despite their different syntactic form. However, when
>> performing SPARQL queries over the merge of the two graphs these two
>> URIs are not treated as the same, therefore making joins of the data
>> impossible (without pre-processing). I noticed this behaviour  
>> first in
>> RAP, but we've been able to replicate the effect in Jena also.
>>
>> So, my question is: is this a bug in RAP, Jena, and presumably other
>> frameworks, or are there cases in which this is actually the desired
>> behaviour (i.e. it's a feature not a bug)? If the latter is true,  
>> does
>> this suggest that as a community we need a convention that we will
>> always mint and use URIs in which specialchars are %-encoded (or the
>> other way around) in order to avoid this kind of situation?
>>
>> Any thoughts/pointers/enlightenment much appreciated,
>
>
> There is a difference between being an escape mechanism and being an
> endocing mechanism.  %2C is not a way to escape a comma into a URI -
> it's a way of encoding the information.  The difference is whether the
> URI really contains "," (escaping) or whether it really contains "%2C"
> (encoding).  In the case of URIs, it's an encoding scheme and the URI
> really does contain the "%2C", and not ",".
>
> For example: in a programming language, using \n for newline and the
> string "\n" then there is the single newline character in the string,
> and it's of length 1, not 2.
>
> RFC3986 gives advice on when to encode (sec 2.4) which is when the URI
> is turned from its subcomponents into the URI character string.   
> Reverse
> at the other end.  But while it's a URI character string, it is just a
> sequence of charcaters without interpretation of %-encoding.
>
> For RDF, which is not constructing URIs from sub-components, the  
> URI is
> the character string. It should not change it (apply %-rules) to do
> comparisons.
>
> So my understanding of:
> """
> Two RDF URI references are equal if and only if they compare as equal,
> character by character, as Unicode strings.
> """
> is that it means compare strings as given, not by applying %-decoding
>
> http://some.example/example,first is not the same as
> http://some.example/example%2Cfirst.
>
> 	Andy
>
>
>>
>> Cheers,
>>
>> Tom.
>>
>> P.S. FWIW the Dbpedia community has recently settled on always using
>> %-encoded URIs.
>>
>> --
>> Tom Heath
>> PhD Student
>> Knowledge Media Institute
>> The Open University
>> Walton Hall
>> Milton Keynes
>> MK7 6AA
>> United Kingdom
>>
>> Tel: +44 (0)1908 653565
>> Fax: +44 (0)1908 653169
>> Web/URI: http://kmi.open.ac.uk/people/tom/
>> Jabber: t.heath%open.ac.uk@buddyspace.org
Received on Monday, 9 July 2007 18:28:19 UTC