Results of Canonicalization implementation and proposed changes (was Re: Canonicalization implementation) from Phil Archer on 2009-04-16 (public-powderwg@w3.org from April 2009)

From: Phil Archer <phil@philarcher.org>
Date: Thu, 16 Apr 2009 10:07:46 +0100
To: Public POWDER <public-powderwg@w3.org>
CC: Thomas Roessler <tlr@w3.org>
Message-ID: <49E6F562.4060207@philarcher.org>
Hi all,

I'm maintaining the e-mail thread here but this mail is something of a 
departure as it ends up with proposed changes to the canonicalisation 
sections of the Grouping Doc.

I've been working on the various issues raised by Thomas Roessler 
regarding canonicalisation. This has been useful and does, I dare to 
hope, lead to something sensible. First some clearing up.

The current spec[1] confuses transmission of data as the output of a 
form and what you actually end up with after you've completed the form 
processing. This is the origin of the following statements:

* Percent encoded triples are converted into the characters they 
represent (e.g. %c3%a7 becomes ç etc.). Note the hexadecimal digits are 
case-insensitive. However, *except for the query string*, reserved 
characters as per Section 6.2.2.2 of RFC 3986 [URIS] must not be 
converted to literals, as that may invalidate the URI/IRI - the reason a 
URI would contain (for example) %2F instead of / would be to distinguish 
between a literal /, such as in 'his/hers', and the / which is used as a 
path separator.

* + characters in the query string are converted to spaces

This is wrong.

Processing of form data is quite separate and defined elsewhere. If you 
enter "http://example.com/Staff/Fran%c3%a7ois/mugshot.jpg" as (GET) form 
data, what gets transmitted is 
"http%3A%2F%2Fexample.com%2FStaff%2FFran%25c3%25a7ois%2Fmugshot.jpg" 
note that the c cedilla is actually transmitted as %25c3%25a7, i.e. the 
% character is itself percent encoded (%25). Once you've got this back 
out of a standard form processing script you have, well back to what you 
started with which is http://example.com/Staff/Fran%c3%a7ois/mugshot.jpg.

Now, we _do_ want to process this, but all the stuff about query strings 
and + characters being transliterated into spaces is misplaced.

And, here's another error, the reference to the URI spec is incorrect. 
Reserved characters are defined in section 2.2 of RFC 3986, not 6.2.2.2.

But... we also need NOT to convert %20 into spaces since to do so 
creates an invalid URI. Our current examples table has this:

http://example.com/my%20doc.doc
canonicalised to
http://example.com/my doc.doc

That can't be right, so I think we should also say that spaces should 
also be left alone. It's probably possible to expand this into white 
space in general but I think that's going beyond the call of duty.

I therefore propose to remove the reference to XFORMS from the intro to 
2.1.4 and change the opening lines of section 2.1.4.1 to:

* If not already so encoded, the IRI character string is converted into 
a sequence of Unicode [UNICODE] characters.

* Percent encoded triples are converted into the characters they 
represent (e.g. %c3%a7 becomes ç etc.). Note the hexadecimal digits are 
case-insensitive. However space characters (%20) and reserved characters 
as per Section 2.2 of RFC 3986 [URIS] must not be converted to literals, 
as that may invalidate the URI/IRI - the reason a URI would contain (for 
example) %2F instead of / would be to distinguish between a literal /, 
such as in 'his/hers', and the / which is used as a path separator.

Drop the line about changing + into spaces completely.

Normalise to Form C is fine.

2.1.4.2 Default values and case folding
Is fine, although I did them in a different order so that I'd rearrange 
it a little to say:

* Where the authority is present, but the scheme is absent, the scheme 
should default to http.

* If the Path is absent, a path of / is appended.

* Change the scheme *and host* string to ASCII lowercase.

* Trailing . characters in the host are removed, i.e. 
http://www.example.com./ becomes http://www.example.com/

* If the host string does not completely consist of ASCII characters, 
apply the ToASCII operation to the host string, with the 
UseSTD3ASCIIRules flag unset and the AllowUnassigned flag set [RFC 
3490]. Note that behavior if the ToASCII operation fails is undefined.

* If the port is specified, but it is the default port for the scheme, 
it is removed.

The table of examples is fine *except* that the IDN for the signma 
example should be xn--sigma-kde.example.org not 
http://xn--sigma-g1e.example.org (unless both the VeriSign tool [2] and 
LibIDN are wrong!)

I built a standalone IRI canonicalisation tool at [3] to do all this 
(source code is available). You can just get the answer or get it to 
tell you what it's doing at each step if you select the verbose output. 
You can see an example including sigma and François at [4].

And so to put this into practice matching against some real (really 
awkward) data. For this purpose I created some examples at [5]. The 
first two files contain François and Céline in the data both as utf8 
characters and their %encoded versions. Either should match a suitable 
IRI, which they do: [6].

Now to achieve this in my own application I did the canonicalisation 
stuff early and stored the data in the MySQL database already 
canonicalised (getting MySQL to work in UTF 8 is a pain and I had to 
decode the %triples in the XML file *before* parsing it to stop Perl 
mucking it up but hey, that's late night coding frustration for you!) . 
Andrea's POWDER Processor works differently so I'm going to be 
interested to see how he handles it but following the earlier logic, I 
end up wanting to amend section 2.1.5 slightly thus:

* If not already so encoded, the strings are converted into a sequence 
of Unicode characters. [No change]

* With the exception of *spaces and * the reserved characters defined in 
Section 2.2 of RFC 3986 [URIS], percent encoded triples are converted 
into the characters they represent. [slight change]

[Delete line about the query string]

* If the data relates to the host, trailing . characters are removed. 
[Same, but higher in the sequence]

* If the data relates to the scheme *or host*, it is normalized to ASCII 
lower case. [Add in mention of host, it's easier to convert ASCII to 
lower case before the IDN stuff!]

* If the data relates to the host, and does not completely consist of 
ASCII characters, the ToASCII operation is applied as described in 
Section 2.1.3 [No change]

* Any values given for the IRI constraints includepathstartswith, 
excludepathstartswith, includeexactpaths or excludeexactpaths must begin 
with the / character which is pre-pended if absent. [No change]

Feel free to play with the processor at [7].

Next steps:
1. Create updated version of the doc and hope that since we've flagged 
this so heavily we're not going to be forced back into LC again! i.e. we 
can use these relatively small changes directly in the PR version.
2. Add the canonicalisation data to the test suite
3. Check it all works for Andrea too.


[1] http://www.w3.org/TR/2009/WD-powder-grouping-20090403/#canon
[2] http://mct.verisign-grs.com/
[3] http://i-sieve.com/cgi-bin/canon.cgi
[4] http://tinyurl.com/cyerr4
[5] http://www.w3.org/2007/powder/Group/powder-test/tests/canon_tests/
[6] http://tinyurl.com/djuzfm
[7] http://www.i-sieve.com/cgi-bin/processor.cgi






Phil Archer wrote:
> OK, just before I shut down for a family break over Easter I've managed 
> to make some progress with this. The problem with the normalisation to 
> Form C was due to the string already being in that form. Re-applying the 
> normalisation to a string that is already NFC caused a problem, at least 
> with the Perl library I am using [1].
> 
> Thankfully that module includes a check routine so the the normalisation 
> only runs if it's needed. I really could do with finding a string that 
> isn't already in NFC to test this out.
> 
> With that problem fixed the standalone canonicalisation tool [2] now 
> seems to be running smoothly. Following this, and with full 
> acknowledgement that Thomas is well within his rights to say "I told you 
> so" the changes to the canonicalisation routine need to be:
> 
> 1. Forget the + to spaces (see original mail)
> 2. Forget the 'except the query string' - same reason
> 3. Switch the order of removing any trailing . characters from the host 
> and applying the ToASCII function.
> 
> I need to document this before people spend time looking at this. 
> Tuesday... (family life really is taking over here, need to stop).
> 
> Cheers
> 
> Phil.
> 
> [1] http://search.cpan.org/~sadahiro/Unicode-Normalize-1.02/Normalize.pm
> 
> Phil Archer wrote:
>> Thomas,
>>
>> Having got the docs published on Friday and all the e-mails sent 
>> today, I turned to revising my implementation of the canonicalisation 
>> steps (http://i-sieve.com/cgi-bin/canon.cgi). My short term aim is to 
>> create a standalone tool that demonstrates the canonicalisation steps 
>> for candidate IRIs and POWDER doc data so we can play around, create 
>> some test data etc.
>>
>> OK, so first things first. I think I now see where my confusion has 
>> been wrt. %decoding a +/space transliteration - it's all in the 
>> form/cgi stuff. The form parsing script I use includes these lines:
>>
>> tr/+/ /;
>> s/%(..)/pack("c",hex($1))/ge;
>>
>> which does the transliteration and then the %decoding, which is 
>> correct and that's why it's in the doc and so firmly fixed in my head. 
>> However... I see where you're concerned and correct too - all that 
>> does is to make sure that the encoding that the browser does is 
>> reversed so if I put in
>>
>> http://example.com/staff/Fran%c3%a7ois
>>
>> I get out...
>>
>> http://example.com/staff/Fran%c3%a7ois
>>
>> Hmmm... not the desired outcome. It's necessary _after_ the initial 
>> form decoding to _then_ do a second round of % decoding to get to what 
>> we actually want which is the c cedila in the name.
>>
>> OK, so in terms of the spec, I can see that the line in section 
>> 2.1.4.1 that says:
>>
>> Percent encoded triples are converted into the characters they 
>> represent... is correct and should stay. But
>>
>> + characters in the query string are converted to spaces
>>
>> is not correct and should go (an IRI with me%20+%20you in the query 
>> string would map to one with 3 spaces which is wrong).
>>
>> Right, moving on through the canonicalisation steps.
>>
>> I found the Perl module that does the LibIDN stuff and got the i-sieve 
>> hosting company to install that successfully. So that something like 
>> this:
>>
>> €ürö.example.com.?me+you=them,finally=this
>>
>> is canonicalised properly to
>>
>> http://xn--r-1gaq1653a.example.com/?me+you=them,finally=this
>>
>> Whoopee!
>>
>> Aha... but I missed something. I had to get the hosting company to 
>> install a library to normalise the string to Form C as well and that 
>> took a little longer. OK, now it is in place so these work with or 
>> without normalisation to Form C:
>>
>> http://example.com/staff/Fran%c3%a7ois
>>
>> http://example.com/my%20doc.doc
>>
>> http://www.example.com/foo/his%2Fhers
>>
>> See for yourself at http://i-sieve.com/cgi-bin/canon.cgi
>>
>> Now try
>>
>> €ürö.example.com.?me+you=them,finally=this
>>
>> That works too, right? That's because I've made the normalisation to 
>> Form C optional. Switch it on and the ToASCII function fails (verbose 
>> output is switched on in case of an error).
>>
>> Now, it would help enormously if I could test whether the Form C thing 
>> is working properly or whether I need to do something more to the code 
>> to make it work properly.
>>
>> Something you can help with perhaps please?
>>
>> Phil.
>>
> 

-- 

Phil Archer
http://philarcher.org/www@20/

i-sieve technologies                |      W3C Mobile Web Initiative
Making Sense of the Buzz            |      www.w3.org/Mobile
Received on Thursday, 16 April 2009 09:07:52 UTC