Canonicalization implementation from Phil Archer on 2009-04-06 (public-powderwg@w3.org from April 2009)

From: Phil Archer <phil@philarcher.org>
Date: Mon, 06 Apr 2009 17:22:25 +0100
To: Thomas Roessler <tlr@w3.org>
CC: Public POWDER <public-powderwg@w3.org>
Message-ID: <49DA2C41.9070309@philarcher.org>

Thomas,

Having got the docs published on Friday and all the e-mails sent today, 
I turned to revising my implementation of the canonicalisation steps 
(http://i-sieve.com/cgi-bin/canon.cgi). My short term aim is to create a 
standalone tool that demonstrates the canonicalisation steps for 
candidate IRIs and POWDER doc data so we can play around, create some 
test data etc.

OK, so first things first. I think I now see where my confusion has been 
wrt. %decoding a +/space transliteration - it's all in the form/cgi 
stuff. The form parsing script I use includes these lines:

tr/+/ /;
s/%(..)/pack("c",hex($1))/ge;

which does the transliteration and then the %decoding, which is correct 
and that's why it's in the doc and so firmly fixed in my head. 
However... I see where you're concerned and correct too - all that does 
is to make sure that the encoding that the browser does is reversed so 
if I put in

http://example.com/staff/Fran%c3%a7ois

I get out...

http://example.com/staff/Fran%c3%a7ois

Hmmm... not the desired outcome. It's necessary _after_ the initial form 
decoding to _then_ do a second round of % decoding to get to what we 
actually want which is the c cedila in the name.

OK, so in terms of the spec, I can see that the line in section 2.1.4.1 
that says:

Percent encoded triples are converted into the characters they 
represent... is correct and should stay. But

+ characters in the query string are converted to spaces

is not correct and should go (an IRI with me%20+%20you in the query 
string would map to one with 3 spaces which is wrong).

Right, moving on through the canonicalisation steps.

I found the Perl module that does the LibIDN stuff and got the i-sieve 
hosting company to install that successfully. So that something like this:

€ürö.example.com.?me+you=them,finally=this

is canonicalised properly to

http://xn--r-1gaq1653a.example.com/?me+you=them,finally=this

Whoopee!

Aha... but I missed something. I had to get the hosting company to 
install a library to normalise the string to Form C as well and that 
took a little longer. OK, now it is in place so these work with or 
without normalisation to Form C:

http://example.com/staff/Fran%c3%a7ois

http://example.com/my%20doc.doc

http://www.example.com/foo/his%2Fhers

See for yourself at http://i-sieve.com/cgi-bin/canon.cgi

Now try

€ürö.example.com.?me+you=them,finally=this

That works too, right? That's because I've made the normalisation to 
Form C optional. Switch it on and the ToASCII function fails (verbose 
output is switched on in case of an error).

Now, it would help enormously if I could test whether the Form C thing 
is working properly or whether I need to do something more to the code 
to make it work properly.

Something you can help with perhaps please?

Phil.

-- 

Phil Archer
http://philarcher.org/www@20/

i-sieve technologies                |      W3C Mobile Web Initiative
Making Sense of the Buzz            |      www.w3.org/Mobile

Received on Monday, 6 April 2009 16:23:02 UTC