ISSUE-5: Support for IRIs from Phil Archer on 2007-05-03 (public-powderwg@w3.org from May 2007)

From: Phil Archer <parcher@icra.org>
Date: Thu, 03 May 2007 14:39:52 +0100
To: Public POWDER <public-powderwg@w3.org>
Message-ID: <4639E628.4000008@icra.org>

I took an action item during the face to face to look into IRI -> URI 
mapping.

This is important, of course, in supporting grouping of resources by 
address. As defined at [1], conversion of an IRI is a simple-enough 
2-stage process:

1. Convert the character string into a sequence of bytes using the UTF-8 
encoding
2. Convert each byte that is not an ASCII letter or digit to %HH, where 
HH is the hexadecimal value of the byte

So that we end up with Fran%c3%a7ois as the encoding of François.

So far so good easy.

But, we want to set out a URI canonicalization regime for our candidate 
resources (to see whether they are elements of the set of resources). 
And right now step 1 in that process as written in the draft document 
[2] says:

Percent-encoding triplets should be converted into their respective 
characters (e.g. %3A should be converted to :, %2F to / etc.). N.B. The 
hexadecimal digits are case-insensitive.

So we're into an endless loop here. I reckon there are only 2 ways out 
of it:

1. We must specify that all matching must be done in UTF-8 (so no 
percent-encoding is necessary)

2. We specify that set definitions can use any character set but that 
non-ASCII characters must be percent encoded.

Actually, there is a third way. It is only the reserved characters that 
need to be escaped in a URI so maybe we say that Percent-encoding 
triplets for reserved URI characters are converted to their respective 
characters which other percent triples are left encoded? But that starts 
to look messy to me.

Any ideas?

Phil.


[1] http://www.w3.org/International/O-URL-code.html
[2] http://www.w3.org/2007/powder/Group/powder-grouping/20070423 [member 
only]

Received on Thursday, 3 May 2007 13:40:14 UTC