Re: Canonicalization implementation from Phil Archer on 2009-04-09 (public-powderwg@w3.org from April 2009)

From: Phil Archer <phil@philarcher.org>
Date: Thu, 09 Apr 2009 18:06:18 +0100
To: Thomas Roessler <tlr@w3.org>
CC: Public POWDER <public-powderwg@w3.org>
Message-ID: <49DE2B0A.5080305@philarcher.org>
OK, just before I shut down for a family break over Easter I've managed 
to make some progress with this. The problem with the normalisation to 
Form C was due to the string already being in that form. Re-applying the 
normalisation to a string that is already NFC caused a problem, at least 
with the Perl library I am using [1].

Thankfully that module includes a check routine so the the normalisation 
only runs if it's needed. I really could do with finding a string that 
isn't already in NFC to test this out.

With that problem fixed the standalone canonicalisation tool [2] now 
seems to be running smoothly. Following this, and with full 
acknowledgement that Thomas is well within his rights to say "I told you 
so" the changes to the canonicalisation routine need to be:

1. Forget the + to spaces (see original mail)
2. Forget the 'except the query string' - same reason
3. Switch the order of removing any trailing . characters from the host 
and applying the ToASCII function.

I need to document this before people spend time looking at this. 
Tuesday... (family life really is taking over here, need to stop).

Cheers

Phil.

[1] http://search.cpan.org/~sadahiro/Unicode-Normalize-1.02/Normalize.pm

Phil Archer wrote:
> Thomas,
> 
> Having got the docs published on Friday and all the e-mails sent today, 
> I turned to revising my implementation of the canonicalisation steps 
> (http://i-sieve.com/cgi-bin/canon.cgi). My short term aim is to create a 
> standalone tool that demonstrates the canonicalisation steps for 
> candidate IRIs and POWDER doc data so we can play around, create some 
> test data etc.
> 
> OK, so first things first. I think I now see where my confusion has been 
> wrt. %decoding a +/space transliteration - it's all in the form/cgi 
> stuff. The form parsing script I use includes these lines:
> 
> tr/+/ /;
> s/%(..)/pack("c",hex($1))/ge;
> 
> which does the transliteration and then the %decoding, which is correct 
> and that's why it's in the doc and so firmly fixed in my head. 
> However... I see where you're concerned and correct too - all that does 
> is to make sure that the encoding that the browser does is reversed so 
> if I put in
> 
> http://example.com/staff/Fran%c3%a7ois
> 
> I get out...
> 
> http://example.com/staff/Fran%c3%a7ois
> 
> Hmmm... not the desired outcome. It's necessary _after_ the initial form 
> decoding to _then_ do a second round of % decoding to get to what we 
> actually want which is the c cedila in the name.
> 
> OK, so in terms of the spec, I can see that the line in section 2.1.4.1 
> that says:
> 
> Percent encoded triples are converted into the characters they 
> represent... is correct and should stay. But
> 
> + characters in the query string are converted to spaces
> 
> is not correct and should go (an IRI with me%20+%20you in the query 
> string would map to one with 3 spaces which is wrong).
> 
> Right, moving on through the canonicalisation steps.
> 
> I found the Perl module that does the LibIDN stuff and got the i-sieve 
> hosting company to install that successfully. So that something like this:
> 
> €ürö.example.com.?me+you=them,finally=this
> 
> is canonicalised properly to
> 
> http://xn--r-1gaq1653a.example.com/?me+you=them,finally=this
> 
> Whoopee!
> 
> Aha... but I missed something. I had to get the hosting company to 
> install a library to normalise the string to Form C as well and that 
> took a little longer. OK, now it is in place so these work with or 
> without normalisation to Form C:
> 
> http://example.com/staff/Fran%c3%a7ois
> 
> http://example.com/my%20doc.doc
> 
> http://www.example.com/foo/his%2Fhers
> 
> See for yourself at http://i-sieve.com/cgi-bin/canon.cgi
> 
> Now try
> 
> €ürö.example.com.?me+you=them,finally=this
> 
> That works too, right? That's because I've made the normalisation to 
> Form C optional. Switch it on and the ToASCII function fails (verbose 
> output is switched on in case of an error).
> 
> Now, it would help enormously if I could test whether the Form C thing 
> is working properly or whether I need to do something more to the code 
> to make it work properly.
> 
> Something you can help with perhaps please?
> 
> Phil.
> 

-- 

Phil Archer
http://philarcher.org/www@20/

i-sieve technologies                |      W3C Mobile Web Initiative
Making Sense of the Buzz            |      www.w3.org/Mobile
Received on Thursday, 9 April 2009 17:06:52 UTC