Canonicalization implementation


Having got the docs published on Friday and all the e-mails sent today, 
I turned to revising my implementation of the canonicalisation steps 
( My short term aim is to create a 
standalone tool that demonstrates the canonicalisation steps for 
candidate IRIs and POWDER doc data so we can play around, create some 
test data etc.

OK, so first things first. I think I now see where my confusion has been 
wrt. %decoding a +/space transliteration - it's all in the form/cgi 
stuff. The form parsing script I use includes these lines:

tr/+/ /;

which does the transliteration and then the %decoding, which is correct 
and that's why it's in the doc and so firmly fixed in my head. 
However... I see where you're concerned and correct too - all that does 
is to make sure that the encoding that the browser does is reversed so 
if I put in

I get out...

Hmmm... not the desired outcome. It's necessary _after_ the initial form 
decoding to _then_ do a second round of % decoding to get to what we 
actually want which is the c cedila in the name.

OK, so in terms of the spec, I can see that the line in section 
that says:

Percent encoded triples are converted into the characters they 
represent... is correct and should stay. But

+ characters in the query string are converted to spaces

is not correct and should go (an IRI with me%20+%20you in the query 
string would map to one with 3 spaces which is wrong).

Right, moving on through the canonicalisation steps.

I found the Perl module that does the LibIDN stuff and got the i-sieve 
hosting company to install that successfully. So that something like this:


is canonicalised properly to,finally=this


Aha... but I missed something. I had to get the hosting company to 
install a library to normalise the string to Form C as well and that 
took a little longer. OK, now it is in place so these work with or 
without normalisation to Form C:

See for yourself at

Now try


That works too, right? That's because I've made the normalisation to 
Form C optional. Switch it on and the ToASCII function fails (verbose 
output is switched on in case of an error).

Now, it would help enormously if I could test whether the Form C thing 
is working properly or whether I need to do something more to the code 
to make it work properly.

Something you can help with perhaps please?



Phil Archer

i-sieve technologies                |      W3C Mobile Web Initiative
Making Sense of the Buzz            |

Received on Monday, 6 April 2009 16:23:02 UTC