RE: %-Encoded (and Non-%-Encoded) URIs in SPARQL Queries from Seaborne, Andy on 2007-07-10 (semantic-web@w3.org from July 2007)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Tue, 10 Jul 2007 11:48:01 +0100
To: "Tim Berners-Lee" <timbl@w3.org>
Cc: "T.Heath" <T.Heath@open.ac.uk>, <semantic-web@w3.org>, <rdfapi-php-interest@lists.sourceforge.net>, <jena-dev@groups.yahoo.com>
Message-ID: <86FE9B2B91ADD04095335314BE6906E801474D5F@sdcexc04.emea.cpqcorp.net>
Tim,

I agree systems should do the helpful thing, especially as I was bitten
by the "~"/%7E thing only last week.  

(I've included pointers and text so others can quicky find the places
I'l talking about.)

rfc3986.txt/6.2.2.2 says unreserved characters can be decoded and
specifically points to where unreserved is defined in 2.3, but does not
go further and say that any character that did not need to be encoded
can be decoded - there's no mention of component parts.

6.2.2.2.
[[
These URIs should
be normalized by decoding any percent-encoded octet that corresponds
to an unreserved character, as described in Section 2.3.
]]

and section 2.3 says:
[[
unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
]]

which does not list ",".  So my processor concludes the URIs differed in
the absence of any other information to avoid false positives.

2396 has a wider list of unreservered but still no ","

unreserved  = alphanum | mark
mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

Getting schme specific: RFC 2616: sec 3.2.3

[[
Characters other than those in the "reserved" and "unsafe" sets (see
   RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.
]]

but there is no production for "unsafe" in 2396.  There is "unwise".

[[
reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                    "$" | ","
]]

so it does include "," hence, still the 2 URIs are different.  Some
domain specific rule might also inform the processor that, say, people's
names can be written either way with in "family,given" form./

It maybe it could analsyise the structure of the URI and conclude that a
"," at this point is safe and so decode but it can't really conclude
that was the intent of the URI producer - I couldn't find anything in
the HTTP spec that would license and did find text that spoke against
it.

	Andy


-------- Original Message --------
> From: Tim Berners-Lee <mailto:timbl@w3.org>
> Date: 9 July 2007 18:08
> 
> I take more or less the opposite view:  It is a Good Thing for systems
> to canonicalize URIS on input, or data and query.  I know RDF does not
> specify this.  However, the URI spec gives one the ability to conclude
> that the URIs are equivalent.   
> 
> There are several levels of canonicalization you can do.
> There was a TAG issue about this
> http://www.w3.org/2001/tag/issues.html#URIEquivalence-15
> "When are two URI variants considered equivalent?"
> A draft finding "How to Compare Uniform Resource Identifiers"
> http://www.textuality.com/tag/uri-comp-4
>   was produced by Tim Bray  about this and the results have been more
> or less folded into the new URI spec. 
> 
> http://www.ietf.org/rfc/rfc3986.txt
> 
> See section 6
> 
> In a way, as the URI spec says you can send the same URI in various
> forms and mean the same thing, I am not doing users a favor if my
> system does not recognize this.  
> 
> In practice, it avoids frustrating bugs like having equivalent URIs
> stand for different things. 
> 
> Cwm will canonicalize  only with the  --closure=n flag set. Also it
> does canonicalize numbers. 
> 
> Tim BL
> 
> 
> On 2007-07 -09, at 06:55, Seaborne, Andy wrote:
> 
> > 
> > -------- Original Message --------
> > > From: T.Heath <>
> > > Date: 8 July 2007 14:36
> > > 
> > > Hi all,
> > > 
> > > I've come across an issue with SPARQL queries over graphs in which
> > > URIs vary in their use of %-encoding, and hope members of this
list
> > > may be able to help out... 
> > > 
> > > Imagine you have two RDF graphs that reference the same URIs,
except
> > > that in one graph special characters in the URIs are %-encoded,
and
> > > in the second they are not. For example:
> > > 
> > > <http://some.example/example,first> in graph1 vs.
> > > <http://some.example/example%2Cfirst> in graph2
> > > 
> > > As far as I understand it (although I may be wrong) both these
URIs
> > > are the "same", despite their different syntactic form. However,
> > > when performing SPARQL queries over the merge of the two graphs
> > > these two URIs are not treated as the same, therefore making joins
> > > of the data impossible (without pre-processing). I noticed this
> > > behaviour first in RAP, but we've been able to replicate the
effect
> > > in Jena also. 
> > > 
> > > So, my question is: is this a bug in RAP, Jena, and presumably
other
> > > frameworks, or are there cases in which this is actually the
desired
> > > behaviour (i.e. it's a feature not a bug)? If the latter is true,
> > > does this suggest that as a community we need a convention that we
> > > will always mint and use URIs in which specialchars are %-encoded
> > > (or the other way around) in order to avoid this kind of
situation?
> > > 
> > > Any thoughts/pointers/enlightenment much appreciated,
> > 
> > 
> > There is a difference between being an escape mechanism and being an
> > endocing mechanism.  %2C is not a way to escape a comma into a URI -
> > it's a way of encoding the information.  The difference is whether
the
> > URI really contains "," (escaping) or whether it really contains
"%2C"
> > (encoding).  In the case of URIs, it's an encoding scheme and the
URI
> > really does contain the "%2C", and not ",".
> > 
> > For example: in a programming language, using \n for newline and the
> > string "\n" then there is the single newline character in the
string,
> > and it's of length 1, not 2. 
> > 
> > RFC3986 gives advice on when to encode (sec 2.4) which is when the
URI
> > is turned from its subcomponents into the URI character string.
> > Reverse at the other end.  But while it's a URI character string, it
> > is just a sequence of charcaters without interpretation of
%-encoding.
> > 
> > For RDF, which is not constructing URIs from sub-components, the URI
> > is the character string. It should not change it (apply %-rules) to
> > do comparisons. 
> > 
> > So my understanding of:
> > """
> > Two RDF URI references are equal if and only if they compare as
equal,
> > character by character, as Unicode strings.
> > """
> > is that it means compare strings as given, not by applying
%-decoding
> > 
> > http://some.example/example,first is not the same as
> > http://some.example/example%2Cfirst.
> > 
> > 	Andy
> > 
> > 
> > > 
> > > Cheers,
> > > 
> > > Tom.
> > > 
> > > P.S. FWIW the Dbpedia community has recently settled on always
> > > using %-encoded URIs. 
> > > 
> > > --
> > > Tom Heath
> > > PhD Student
> > > Knowledge Media Institute
> > > The Open University
> > > Walton Hall
> > > Milton Keynes
> > > MK7 6AA
> > > United Kingdom
> > > 
> > > Tel: +44 (0)1908 653565
> > > Fax: +44 (0)1908 653169
> > > Web/URI: http://kmi.open.ac.uk/people/tom/
> > > Jabber: t.heath%open.ac.uk@buddyspace.org
Received on Tuesday, 10 July 2007 10:48:13 UTC