- From: Phil Archer <parcher@icra.org>
- Date: Thu, 03 May 2007 14:39:52 +0100
- To: Public POWDER <public-powderwg@w3.org>
I took an action item during the face to face to look into IRI -> URI mapping. This is important, of course, in supporting grouping of resources by address. As defined at [1], conversion of an IRI is a simple-enough 2-stage process: 1. Convert the character string into a sequence of bytes using the UTF-8 encoding 2. Convert each byte that is not an ASCII letter or digit to %HH, where HH is the hexadecimal value of the byte So that we end up with Fran%c3%a7ois as the encoding of François. So far so good easy. But, we want to set out a URI canonicalization regime for our candidate resources (to see whether they are elements of the set of resources). And right now step 1 in that process as written in the draft document [2] says: Percent-encoding triplets should be converted into their respective characters (e.g. %3A should be converted to :, %2F to / etc.). N.B. The hexadecimal digits are case-insensitive. So we're into an endless loop here. I reckon there are only 2 ways out of it: 1. We must specify that all matching must be done in UTF-8 (so no percent-encoding is necessary) 2. We specify that set definitions can use any character set but that non-ASCII characters must be percent encoded. Actually, there is a third way. It is only the reserved characters that need to be escaped in a URI so maybe we say that Percent-encoding triplets for reserved URI characters are converted to their respective characters which other percent triples are left encoded? But that starts to look messy to me. Any ideas? Phil. [1] http://www.w3.org/International/O-URL-code.html [2] http://www.w3.org/2007/powder/Group/powder-grouping/20070423 [member only]
Received on Thursday, 3 May 2007 13:40:14 UTC