Re: [jena-dev] RE: %-Encoded (and Non-%-Encoded) URIs in SPARQL Queries from Jeremy Carroll on 2007-07-10 (semantic-web@w3.org from July 2007)

From: Jeremy Carroll <jjc@hpl.hp.com>
Date: Tue, 10 Jul 2007 12:14:14 +0100
To: jena-dev@yahoogroups.com
CC: Tim Berners-Lee <timbl@w3.org>, "T.Heath" <T.Heath@open.ac.uk>, semantic-web@w3.org, rdfapi-php-interest@lists.sourceforge.net
Message-ID: <46936A06.3010706@hpl.hp.com>
A couple of quotes from RFC 3987 that appear relevant are:

[[
  If two IRIs, when considered as character strings, are identical,
    then it is safe to conclude that they are equivalent.  This type of
    equivalence test has very low computational cost and is in wide use
    in a variety of applications, particularly in the domain of parsing.
    It is also used when a definitive answer to the question of IRI
    equivalence is needed that is independent of the scheme used and that
    can be calculated quickly and without accessing a network.  An
    example of such a case is XML Namespaces ([XMLNamespace]).
]]

and concerning %-encoding in particular

[[
  If the IRI is to be passed to another
   application or used further in some other way, its original form MUST
    be preserved.  The conversion described here should be performed only
    for local comparison.
]]

In a semantic web context, I interpret that as suggesting that if we 
have references to resources using the three IRIs:
  "http://example.org/~user",
    "http://example.org/%7euser", and "http://example.org/%7Euser"
then, we should not treat these as (syntactically) identical, but we 
could add owl:sameAs triples to capture their semantic identity.

Jeremy




Seaborne, Andy wrote:
> Tim,
> 
> I agree systems should do the helpful thing, especially as I was bitten
> by the "~"/%7E thing only last week.  
> 
> (I've included pointers and text so others can quicky find the places
> I'l talking about.)
> 
> rfc3986.txt/6.2.2.2 says unreserved characters can be decoded and
> specifically points to where unreserved is defined in 2.3, but does not
> go further and say that any character that did not need to be encoded
> can be decoded - there's no mention of component parts.
> 
> 6.2.2.2.
> [[
> These URIs should
> be normalized by decoding any percent-encoded octet that corresponds
> to an unreserved character, as described in Section 2.3.
> ]]
> 
> and section 2.3 says:
> [[
> unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
> ]]
> 
> which does not list ",".  So my processor concludes the URIs differed in
> the absence of any other information to avoid false positives.
> 
> 2396 has a wider list of unreservered but still no ","
> 
> unreserved  = alphanum | mark
> mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
> 
> Getting schme specific: RFC 2616: sec 3.2.3
> 
> [[
> Characters other than those in the "reserved" and "unsafe" sets (see
>    RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.
> ]]
> 
> but there is no production for "unsafe" in 2396.  There is "unwise".
> 
> [[
> reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
>                     "$" | ","
> ]]
> 
> so it does include "," hence, still the 2 URIs are different.  Some
> domain specific rule might also inform the processor that, say, people's
> names can be written either way with in "family,given" form./
> 
> It maybe it could analsyise the structure of the URI and conclude that a
> "," at this point is safe and so decode but it can't really conclude
> that was the intent of the URI producer - I couldn't find anything in
> the HTTP spec that would license and did find text that spoke against
> it.
> 
> 	Andy
> 
> 
> -------- Original Message --------
>> From: Tim Berners-Lee <mailto:timbl@w3.org>
>> Date: 9 July 2007 18:08
>>
>> I take more or less the opposite view:  It is a Good Thing for systems
>> to canonicalize URIS on input, or data and query.  I know RDF does not
>> specify this.  However, the URI spec gives one the ability to conclude
>> that the URIs are equivalent.   
>>
>> There are several levels of canonicalization you can do.
>> There was a TAG issue about this
>> http://www.w3.org/2001/tag/issues.html#URIEquivalence-15
>> "When are two URI variants considered equivalent?"
>> A draft finding "How to Compare Uniform Resource Identifiers"
>> http://www.textuality.com/tag/uri-comp-4
>>   was produced by Tim Bray  about this and the results have been more
>> or less folded into the new URI spec. 
>>
>> http://www.ietf.org/rfc/rfc3986.txt
>>
>> See section 6
>>
>> In a way, as the URI spec says you can send the same URI in various
>> forms and mean the same thing, I am not doing users a favor if my
>> system does not recognize this.  
>>
>> In practice, it avoids frustrating bugs like having equivalent URIs
>> stand for different things. 
>>
>> Cwm will canonicalize  only with the  --closure=n flag set. Also it
>> does canonicalize numbers. 
>>
>> Tim BL
>>
>>
>> On 2007-07 -09, at 06:55, Seaborne, Andy wrote:
>>
>>> -------- Original Message --------
>>>> From: T.Heath <>
>>>> Date: 8 July 2007 14:36
>>>>
>>>> Hi all,
>>>>
>>>> I've come across an issue with SPARQL queries over graphs in which
>>>> URIs vary in their use of %-encoding, and hope members of this
> list
>>>> may be able to help out... 
>>>>
>>>> Imagine you have two RDF graphs that reference the same URIs,
> except
>>>> that in one graph special characters in the URIs are %-encoded,
> and
>>>> in the second they are not. For example:
>>>>
>>>> <http://some.example/example,first> in graph1 vs.
>>>> <http://some.example/example%2Cfirst> in graph2
>>>>
>>>> As far as I understand it (although I may be wrong) both these
> URIs
>>>> are the "same", despite their different syntactic form. However,
>>>> when performing SPARQL queries over the merge of the two graphs
>>>> these two URIs are not treated as the same, therefore making joins
>>>> of the data impossible (without pre-processing). I noticed this
>>>> behaviour first in RAP, but we've been able to replicate the
> effect
>>>> in Jena also. 
>>>>
>>>> So, my question is: is this a bug in RAP, Jena, and presumably
> other
>>>> frameworks, or are there cases in which this is actually the
> desired
>>>> behaviour (i.e. it's a feature not a bug)? If the latter is true,
>>>> does this suggest that as a community we need a convention that we
>>>> will always mint and use URIs in which specialchars are %-encoded
>>>> (or the other way around) in order to avoid this kind of
> situation?
>>>> Any thoughts/pointers/enlightenment much appreciated,
>>>
>>> There is a difference between being an escape mechanism and being an
>>> endocing mechanism.  %2C is not a way to escape a comma into a URI -
>>> it's a way of encoding the information.  The difference is whether
> the
>>> URI really contains "," (escaping) or whether it really contains
> "%2C"
>>> (encoding).  In the case of URIs, it's an encoding scheme and the
> URI
>>> really does contain the "%2C", and not ",".
>>>
>>> For example: in a programming language, using \n for newline and the
>>> string "\n" then there is the single newline character in the
> string,
>>> and it's of length 1, not 2. 
>>>
>>> RFC3986 gives advice on when to encode (sec 2.4) which is when the
> URI
>>> is turned from its subcomponents into the URI character string.
>>> Reverse at the other end.  But while it's a URI character string, it
>>> is just a sequence of charcaters without interpretation of
> %-encoding.
>>> For RDF, which is not constructing URIs from sub-components, the URI
>>> is the character string. It should not change it (apply %-rules) to
>>> do comparisons. 
>>>
>>> So my understanding of:
>>> """
>>> Two RDF URI references are equal if and only if they compare as
> equal,
>>> character by character, as Unicode strings.
>>> """
>>> is that it means compare strings as given, not by applying
> %-decoding
>>> http://some.example/example,first is not the same as
>>> http://some.example/example%2Cfirst.
>>>
>>> 	Andy
>>>
>>>
>>>> Cheers,
>>>>
>>>> Tom.
>>>>
>>>> P.S. FWIW the Dbpedia community has recently settled on always
>>>> using %-encoded URIs. 
>>>>
>>>> --
>>>> Tom Heath
>>>> PhD Student
>>>> Knowledge Media Institute
>>>> The Open University
>>>> Walton Hall
>>>> Milton Keynes
>>>> MK7 6AA
>>>> United Kingdom
>>>>
>>>> Tel: +44 (0)1908 653565
>>>> Fax: +44 (0)1908 653169
>>>> Web/URI: http://kmi.open.ac.uk/people/tom/
>>>> Jabber: t.heath%open.ac.uk@buddyspace.org
> 
> 
>  
> Yahoo! Groups Links
> 
> <*> To visit your group on the web, go to:
>     http://groups.yahoo.com/group/jena-dev/
> 
> <*> Your email settings:
>     Individual Email | Traditional
> 
> <*> To change settings online go to:
>     http://groups.yahoo.com/group/jena-dev/join
>     (Yahoo! ID required)
> 
> <*> To change settings via email:
>     mailto:jena-dev-digest@yahoogroups.com 
>     mailto:jena-dev-fullfeatured@yahoogroups.com
> 
> <*> To unsubscribe from this group, send an email to:
>     jena-dev-unsubscribe@yahoogroups.com
> 
> <*> Your use of Yahoo! Groups is subject to:
>     http://docs.yahoo.com/info/terms/
>  

-- 
Hewlett-Packard Limited
registered Office: Cain Road, Bracknell, Berks RG12 1HN
Registered No: 690597 England
Received on Tuesday, 10 July 2007 11:14:48 UTC