Re: %-Encoded (and Non-%-Encoded) URIs in SPARQL Queries from Alan Ruttenberg on 2007-07-09 (semantic-web@w3.org from July 2007)

From: Alan Ruttenberg <alanruttenberg@gmail.com>
Date: Mon, 9 Jul 2007 11:27:01 -0400
To: "T.Heath" <T.Heath@open.ac.uk>, Semantic Web <semantic-web@w3.org>, rdfapi-php-interest@lists.sourceforge.net, jena-dev@groups.yahoo.com, "Seaborne, Andy" <andy.seaborne@hp.com>, Richard Newman <rnewman@twinql.com>
Cc: Tim Berners-Lee <timbl@w3.org>, Eric Prud'hommeaux <eric@w3.org>
Message-Id: <75D3875A-FF85-4968-8DE8-1D938E60CED2@gmail.com>
Upon reading Andy and Richard's posts, I think my interpretation  
(that the implementation is buggy) is wrong, but that section 6.4  
that I cite needs to be amended.

Here's the line of thinking:

1) The URI specification says when characters need to be %-encoded,  
and these instructions are dependent on the URI scheme. I don't see  
any limitation on when you can %-encode characters - in fact explicit  
mention of this possibility is mentioned with reference to characters  
such as "~" which may be encoded in some cases.  I do see rules,  
that, depending on the scheme, ensure you can decode these %-encode  
characters without changing the meaning of the URI.
2) Therefore if an RDF store was to be able to decode these  
characters with changing the URI it would need to know details of  
each URI scheme. This seems to be an unreasonable burden, as URI  
schemes can be invented over time.

However,
3) Section 6.4 talks about determining whether two URI references are  
equal by making reference to a "Unicode string  that <...> when  
subjected to the encoding described below", and then implies that the  
comparison of URI references be based on (that!?) Unicode string.
4) However that Unicode string is itself subject to encoding/escaping  
with "%".

Let's suppose that the initial Unicode strings, are already %encoded  
in some way.

So we have

Unicode string A, encoded to be A' and Unicode string B encoded to be  
B'. The RDF store presumably has access to A' and B' and needs to  
compute A and B to determine whether they are the same. The problem  
is that some of the characters that are escaped in this process  
( e.g. octets that do not correspond to permitted US-ASCII  
characters ) might lead to strings which are the same as some strings  
that are already %-encoded in A and B. There would then be no way to  
determine, looking at A' and B' alone, what the underlying A and B  
are, and so the required comparison is impossible to make.

--

Andy makes the distinction between encoding and escaping, but 6.4  
confuses this distinction by mentioning both encoding and escaping.

Richard talks about whether the "unescaped form is right". However, i  
there is no proscription against %-encoding at will (other than  
encoding previously encoded URIs) , and that therefore the %-encoded  
version would be right too.

He also says: "All URIs you generate and use should be valid and  
normalized". Valid yes, but, at I don't see any explicit requirement  
that URIs to be used in RDF should be normalized.

--

I think 6.4 could be fixed by making it clear that the comparison  
should be made with

         "the Unicode string that is the one that is generated  from  
the initial Unicode string, *after* applying the %-encoding rules  
described in 6.4"

In addition, the SPARQL specification should clarify that it is blind  
to URI encoding schemes, and as a courtesy to the reader, explain the  
reasoning for and consequences of that (described above). In order  
that correct implementations give the same answers given the same  
inputs, it should make it clear that triple stores (and SPARQL  
implementations on top of them) must NOT make any normalizations -  
otherwise they could return different answers given the same inputs.

Finally, some "best practice" document should formalize Richard's  
statement "All URIs you generate and use should be normalized" and  
making reference to an unambiguous account of what it means to  
normalize a URI, recognizing, however, that to the extent that this  
requires a knowledge of all URI schemes used, and that this may be an  
unreasonable burden for, example, and aggregator of information,  
doing so may not be feasible.

--

An alternative, given the importance of the http URI scheme in  
particular, is that the SPARQL specification specify that  
implementations should normalize URIs for the http scheme  
specifically and therefore avoid a class of confusion (and,  
unintentional errors) that might commonly occur, such as the one that  
Tim identifies.

-Alan

On Jul 9, 2007, at 12:05 AM, Alan Ruttenberg wrote:

> I think it's a bug in the implementations.  I base this on http:// 
> www.w3.org/TR/rdf-concepts/#section-Graph-URIref
>
>>  6.4 RDF URI References
>> A URI reference within an RDF graph (an RDF URI reference) is a  
>> Unicode string [UNICODE] that:
>>
>> does not contain any control characters ( #x00 - #x1F, #x7F-#x9F)
>> and would produce a valid URI character sequence (per RFC2396  
>> [URI], sections 2.1) representing an absolute URI with optional  
>> fragment identifier when subjected to the encoding described below.
>> The encoding consists of:
>>
>> encoding the Unicode string as UTF-8 [RFC-2279], giving a sequence  
>> of octet values.
>> %-escaping octets that do not correspond to permitted US-ASCII  
>> characters.
>> The disallowed octets that must be %-escaped include all those  
>> that do not correspond to US-ASCII characters, and the excluded  
>> characters listed in Section 2.4 of [URI], except for the number  
>> sign (#), percent sign (%), and the square bracket characters re- 
>> allowed in [RFC-2732].
>>
>> Disallowed octets must be escaped with the URI escaping mechanism  
>> (that is, converted to %HH, where HH is the 2-digit hexadecimal  
>> numeral corresponding to the octet value).
>>
>> Two RDF URI references are equal if and only if they compare as  
>> equal, character by character, as Unicode strings.
>
> -Alan
>
> Jul 8, 2007, at 9:36 AM, T.Heath wrote:
>
>>
>> Hi all,
>>
>> I've come across an issue with SPARQL queries over graphs in which  
>> URIs
>> vary in their use of %-encoding, and hope members of this list may be
>> able to help out...
>>
>> Imagine you have two RDF graphs that reference the same URIs, except
>> that in one graph special characters in the URIs are %-encoded,  
>> and in
>> the second they are not. For example:
>>
>> <http://some.example/example,first> in graph1 vs.
>> <http://some.example/example%2Cfirst> in graph2
>>
>> As far as I understand it (although I may be wrong) both these  
>> URIs are
>> the "same", despite their different syntactic form. However, when
>> performing SPARQL queries over the merge of the two graphs these two
>> URIs are not treated as the same, therefore making joins of the data
>> impossible (without pre-processing). I noticed this behaviour  
>> first in
>> RAP, but we've been able to replicate the effect in Jena also.
>>
>> So, my question is: is this a bug in RAP, Jena, and presumably other
>> frameworks, or are there cases in which this is actually the desired
>> behaviour (i.e. it's a feature not a bug)? If the latter is true,  
>> does
>> this suggest that as a community we need a convention that we will
>> always mint and use URIs in which specialchars are %-encoded (or the
>> other way around) in order to avoid this kind of situation?
>>
>> Any thoughts/pointers/enlightenment much appreciated,
>>
>> Cheers,
>>
>> Tom.
>>
>> P.S. FWIW the Dbpedia community has recently settled on always using
>> %-encoded URIs.
>>
>> -- 
>> Tom Heath
>> PhD Student
>> Knowledge Media Institute
>> The Open University
>> Walton Hall
>> Milton Keynes
>> MK7 6AA
>> United Kingdom
>>
>> Tel: +44 (0)1908 653565
>> Fax: +44 (0)1908 653169
>> Web/URI: http://kmi.open.ac.uk/people/tom/
>> Jabber: t.heath%open.ac.uk@buddyspace.org
>>
>
Received on Monday, 9 July 2007 16:53:02 UTC