Re: Results of Canonicalization implementation and proposed changes (was Re: Canonicalization implementation) from Phil Archer on 2009-04-27 (public-powderwg@w3.org from April 2009)

From: Phil Archer <phil@philarcher.org>
Date: Mon, 27 Apr 2009 15:42:52 +0100
To: Public POWDER <public-powderwg@w3.org>
CC: Thomas Roessler <tlr@w3.org>
Message-ID: <49F5C46C.5020603@philarcher.org>
Hi all,

Thomas Roessler and I were both in Madrid last week and we spoke about 
this. I also spoke to Richard Ishida about it (he's W3C's 
internationalisation lead) and all in all, I'm pretty confident that, 
with their help, the canonicalisation section is now done.

One small caveat on that. Contrary to my earlier actions, it _does_ 
matter that case folding comes after the ToASCII step so that I've 
simply moved that down the sequence a little in both documentation and 
implementation (xn--sigma-kde and XN--sigma-KDE should, I understand, 
the treated as equivalent).

The IRI steps now are:

1. Convert to Unicode of not already done so.
2. Decode percent triples excluding spaces and reserved characters.
3. Normalise to Form C.

4. default to http
5. Path of / if no path given
6. Remove trailing . characters from host
7. Perform ToASCII if necessary
8. Scheme and host to ASCII lower case
9. Default port removed.

The data steps are:

1. Convert to Unicode of not already done so.
2. Decode percent triples excluding spaces and reserved characters.
3. Normalise to Form C
4. Remove trailing . characters from host
5. Perform ToASCII if necessary
6. Add / to relevant path-related constraints.
7. Port handling unchanged

As promised the other day, I've added some examples to the Test Suite 
[1]. There's a bunch of stuff about this at the end of the 
implementation report [2] too.

I've also been playing with the validator. I'd got my knickers in a bit 
of a twist there too with all this encoding and decoding lark. But it's 
sorted now. And as proof, I invite you click link [3].

This takes a POWDER file that includes a non-ASCII character in the 
domain name quoted in both the abouthosts and includehosts elements [4] 
and puts it through the POWDER to POWDER-BASE script which returns 
RDF/XML... which is then fed into the processor as data with a candidate 
IRI of sigmaσ.example.ORg../foo.page.html - i.e. one with loads of 'issues.'

Would I be pointing you to it if it didn't work?

Grouping doc with all this in is at [5].

Cheers

Phil.

N.B. Most of these links are member only.
[1] http://www.w3.org/2007/powder/Group/powder-test/20090428.html
[2] http://www.w3.org/2007/powder/Group/features.html#canonTests
[3] http://tinyurl.com/d4hkun
[4] 
http://www.w3.org/2007/powder/Group/powder-test/tests/canon_tests/match006a.xml
[5] http://www.w3.org/2007/powder/Group/powder-grouping/20090428.html


Phil Archer wrote:
> Hi all,
> 
> I'm maintaining the e-mail thread here but this mail is something of a 
> departure as it ends up with proposed changes to the canonicalisation 
> sections of the Grouping Doc.
> 
> I've been working on the various issues raised by Thomas Roessler 
> regarding canonicalisation. This has been useful and does, I dare to 
> hope, lead to something sensible. First some clearing up.
> 
> The current spec[1] confuses transmission of data as the output of a 
> form and what you actually end up with after you've completed the form 
> processing. This is the origin of the following statements:
> 
> * Percent encoded triples are converted into the characters they 
> represent (e.g. %c3%a7 becomes ç etc.). Note the hexadecimal digits are 
> case-insensitive. However, *except for the query string*, reserved 
> characters as per Section 6.2.2.2 of RFC 3986 [URIS] must not be 
> converted to literals, as that may invalidate the URI/IRI - the reason a 
> URI would contain (for example) %2F instead of / would be to distinguish 
> between a literal /, such as in 'his/hers', and the / which is used as a 
> path separator.
> 
> * + characters in the query string are converted to spaces
> 
> This is wrong.
> 
> Processing of form data is quite separate and defined elsewhere. If you 
> enter "http://example.com/Staff/Fran%c3%a7ois/mugshot.jpg" as (GET) form 
> data, what gets transmitted is 
> "http%3A%2F%2Fexample.com%2FStaff%2FFran%25c3%25a7ois%2Fmugshot.jpg" 
> note that the c cedilla is actually transmitted as %25c3%25a7, i.e. the 
> % character is itself percent encoded (%25). Once you've got this back 
> out of a standard form processing script you have, well back to what you 
> started with which is http://example.com/Staff/Fran%c3%a7ois/mugshot.jpg.
> 
> Now, we _do_ want to process this, but all the stuff about query strings 
> and + characters being transliterated into spaces is misplaced.
> 
> And, here's another error, the reference to the URI spec is incorrect. 
> Reserved characters are defined in section 2.2 of RFC 3986, not 6.2.2.2.
> 
> But... we also need NOT to convert %20 into spaces since to do so 
> creates an invalid URI. Our current examples table has this:
> 
> http://example.com/my%20doc.doc
> canonicalised to
> http://example.com/my doc.doc
> 
> That can't be right, so I think we should also say that spaces should 
> also be left alone. It's probably possible to expand this into white 
> space in general but I think that's going beyond the call of duty.
> 
> I therefore propose to remove the reference to XFORMS from the intro to 
> 2.1.4 and change the opening lines of section 2.1.4.1 to:
> 
> * If not already so encoded, the IRI character string is converted into 
> a sequence of Unicode [UNICODE] characters.
> 
> * Percent encoded triples are converted into the characters they 
> represent (e.g. %c3%a7 becomes ç etc.). Note the hexadecimal digits are 
> case-insensitive. However space characters (%20) and reserved characters 
> as per Section 2.2 of RFC 3986 [URIS] must not be converted to literals, 
> as that may invalidate the URI/IRI - the reason a URI would contain (for 
> example) %2F instead of / would be to distinguish between a literal /, 
> such as in 'his/hers', and the / which is used as a path separator.
> 
> Drop the line about changing + into spaces completely.
> 
> Normalise to Form C is fine.
> 
> 2.1.4.2 Default values and case folding
> Is fine, although I did them in a different order so that I'd rearrange 
> it a little to say:
> 
> * Where the authority is present, but the scheme is absent, the scheme 
> should default to http.
> 
> * If the Path is absent, a path of / is appended.
> 
> * Change the scheme *and host* string to ASCII lowercase.
> 
> * Trailing . characters in the host are removed, i.e. 
> http://www.example.com./ becomes http://www.example.com/
> 
> * If the host string does not completely consist of ASCII characters, 
> apply the ToASCII operation to the host string, with the 
> UseSTD3ASCIIRules flag unset and the AllowUnassigned flag set [RFC 
> 3490]. Note that behavior if the ToASCII operation fails is undefined.
> 
> * If the port is specified, but it is the default port for the scheme, 
> it is removed.
> 
> The table of examples is fine *except* that the IDN for the signma 
> example should be xn--sigma-kde.example.org not 
> http://xn--sigma-g1e.example.org (unless both the VeriSign tool [2] and 
> LibIDN are wrong!)
> 
> I built a standalone IRI canonicalisation tool at [3] to do all this 
> (source code is available). You can just get the answer or get it to 
> tell you what it's doing at each step if you select the verbose output. 
> You can see an example including sigma and François at [4].
> 
> And so to put this into practice matching against some real (really 
> awkward) data. For this purpose I created some examples at [5]. The 
> first two files contain François and Céline in the data both as utf8 
> characters and their %encoded versions. Either should match a suitable 
> IRI, which they do: [6].
> 
> Now to achieve this in my own application I did the canonicalisation 
> stuff early and stored the data in the MySQL database already 
> canonicalised (getting MySQL to work in UTF 8 is a pain and I had to 
> decode the %triples in the XML file *before* parsing it to stop Perl 
> mucking it up but hey, that's late night coding frustration for you!) . 
> Andrea's POWDER Processor works differently so I'm going to be 
> interested to see how he handles it but following the earlier logic, I 
> end up wanting to amend section 2.1.5 slightly thus:
> 
> * If not already so encoded, the strings are converted into a sequence 
> of Unicode characters. [No change]
> 
> * With the exception of *spaces and * the reserved characters defined in 
> Section 2.2 of RFC 3986 [URIS], percent encoded triples are converted 
> into the characters they represent. [slight change]
> 
> [Delete line about the query string]
> 
> * If the data relates to the host, trailing . characters are removed. 
> [Same, but higher in the sequence]
> 
> * If the data relates to the scheme *or host*, it is normalized to ASCII 
> lower case. [Add in mention of host, it's easier to convert ASCII to 
> lower case before the IDN stuff!]
> 
> * If the data relates to the host, and does not completely consist of 
> ASCII characters, the ToASCII operation is applied as described in 
> Section 2.1.3 [No change]
> 
> * Any values given for the IRI constraints includepathstartswith, 
> excludepathstartswith, includeexactpaths or excludeexactpaths must begin 
> with the / character which is pre-pended if absent. [No change]
> 
> Feel free to play with the processor at [7].
> 
> Next steps:
> 1. Create updated version of the doc and hope that since we've flagged 
> this so heavily we're not going to be forced back into LC again! i.e. we 
> can use these relatively small changes directly in the PR version.
> 2. Add the canonicalisation data to the test suite
> 3. Check it all works for Andrea too.
> 
> 
> [1] http://www.w3.org/TR/2009/WD-powder-grouping-20090403/#canon
> [2] http://mct.verisign-grs.com/
> [3] http://i-sieve.com/cgi-bin/canon.cgi
> [4] http://tinyurl.com/cyerr4
> [5] http://www.w3.org/2007/powder/Group/powder-test/tests/canon_tests/
> [6] http://tinyurl.com/djuzfm
> [7] http://www.i-sieve.com/cgi-bin/processor.cgi
> 
> 
> 
> 
> 
> 
> Phil Archer wrote:
>> OK, just before I shut down for a family break over Easter I've 
>> managed to make some progress with this. The problem with the 
>> normalisation to Form C was due to the string already being in that 
>> form. Re-applying the normalisation to a string that is already NFC 
>> caused a problem, at least with the Perl library I am using [1].
>>
>> Thankfully that module includes a check routine so the the 
>> normalisation only runs if it's needed. I really could do with finding 
>> a string that isn't already in NFC to test this out.
>>
>> With that problem fixed the standalone canonicalisation tool [2] now 
>> seems to be running smoothly. Following this, and with full 
>> acknowledgement that Thomas is well within his rights to say "I told 
>> you so" the changes to the canonicalisation routine need to be:
>>
>> 1. Forget the + to spaces (see original mail)
>> 2. Forget the 'except the query string' - same reason
>> 3. Switch the order of removing any trailing . characters from the 
>> host and applying the ToASCII function.
>>
>> I need to document this before people spend time looking at this. 
>> Tuesday... (family life really is taking over here, need to stop).
>>
>> Cheers
>>
>> Phil.
>>
>> [1] http://search.cpan.org/~sadahiro/Unicode-Normalize-1.02/Normalize.pm
>>
>> Phil Archer wrote:
>>> Thomas,
>>>
>>> Having got the docs published on Friday and all the e-mails sent 
>>> today, I turned to revising my implementation of the canonicalisation 
>>> steps (http://i-sieve.com/cgi-bin/canon.cgi). My short term aim is to 
>>> create a standalone tool that demonstrates the canonicalisation steps 
>>> for candidate IRIs and POWDER doc data so we can play around, create 
>>> some test data etc.
>>>
>>> OK, so first things first. I think I now see where my confusion has 
>>> been wrt. %decoding a +/space transliteration - it's all in the 
>>> form/cgi stuff. The form parsing script I use includes these lines:
>>>
>>> tr/+/ /;
>>> s/%(..)/pack("c",hex($1))/ge;
>>>
>>> which does the transliteration and then the %decoding, which is 
>>> correct and that's why it's in the doc and so firmly fixed in my 
>>> head. However... I see where you're concerned and correct too - all 
>>> that does is to make sure that the encoding that the browser does is 
>>> reversed so if I put in
>>>
>>> http://example.com/staff/Fran%c3%a7ois
>>>
>>> I get out...
>>>
>>> http://example.com/staff/Fran%c3%a7ois
>>>
>>> Hmmm... not the desired outcome. It's necessary _after_ the initial 
>>> form decoding to _then_ do a second round of % decoding to get to 
>>> what we actually want which is the c cedila in the name.
>>>
>>> OK, so in terms of the spec, I can see that the line in section 
>>> 2.1.4.1 that says:
>>>
>>> Percent encoded triples are converted into the characters they 
>>> represent... is correct and should stay. But
>>>
>>> + characters in the query string are converted to spaces
>>>
>>> is not correct and should go (an IRI with me%20+%20you in the query 
>>> string would map to one with 3 spaces which is wrong).
>>>
>>> Right, moving on through the canonicalisation steps.
>>>
>>> I found the Perl module that does the LibIDN stuff and got the 
>>> i-sieve hosting company to install that successfully. So that 
>>> something like this:
>>>
>>> €ürö.example.com.?me+you=them,finally=this
>>>
>>> is canonicalised properly to
>>>
>>> http://xn--r-1gaq1653a.example.com/?me+you=them,finally=this
>>>
>>> Whoopee!
>>>
>>> Aha... but I missed something. I had to get the hosting company to 
>>> install a library to normalise the string to Form C as well and that 
>>> took a little longer. OK, now it is in place so these work with or 
>>> without normalisation to Form C:
>>>
>>> http://example.com/staff/Fran%c3%a7ois
>>>
>>> http://example.com/my%20doc.doc
>>>
>>> http://www.example.com/foo/his%2Fhers
>>>
>>> See for yourself at http://i-sieve.com/cgi-bin/canon.cgi
>>>
>>> Now try
>>>
>>> €ürö.example.com.?me+you=them,finally=this
>>>
>>> That works too, right? That's because I've made the normalisation to 
>>> Form C optional. Switch it on and the ToASCII function fails (verbose 
>>> output is switched on in case of an error).
>>>
>>> Now, it would help enormously if I could test whether the Form C 
>>> thing is working properly or whether I need to do something more to 
>>> the code to make it work properly.
>>>
>>> Something you can help with perhaps please?
>>>
>>> Phil.
>>>
>>
> 

-- 

Phil Archer
http://philarcher.org/www@20/

i-sieve technologies                |      W3C Mobile Web Initiative
Making Sense of the Buzz            |      www.w3.org/Mobile
Received on Monday, 27 April 2009 14:44:25 UTC