Re: Results of Canonicalization implementation and proposed changes (was Re: Canonicalization implementation) from Phil Archer on 2009-04-17 (public-powderwg@w3.org from April 2009)

From: Phil Archer <phil@philarcher.org>
Date: Fri, 17 Apr 2009 08:37:01 +0100
To: "Smith, Kevin, (R&D) VF-Group" <Kevin.Smith@vodafone.com>
CC: Public POWDER <public-powderwg@w3.org>
Message-ID: <49E8319D.1040601@philarcher.org>
Hi Kev,

Sorry, I just realised I'd not acknowledged this. The correction will be 
made.

Thanks.

Phil.

Smith, Kevin, (R&D) VF-Group wrote:
> Hi Phil,
> 
> In section 2.3 of [1], just below example 2.6 it says:
> 
> "In addition, the < (less than) character MUST always be escaped since it could be mistaken for the beginning of the closing <includeregex> tag."
> 
> ...and then underneath '&' is listed as one of the characters that should be escaped. However according to the XML spec [2] '&' must always be escaped; 
> 
> "The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings " &amp;  " and " &lt;  " respectively.  "
> 
> 
> ...so maybe remove '&' from the 'SHOULD be escaped' list in [1] and have:
> 
> "In addition, the < (less than) and & (ampersand) characters MUST always be escaped to &lt; and &amp; as per the XML specification [2]."
> 
> Cheers
> Kev
> 
> [1] http://www.w3.org/TR/2009/WD-powder-grouping-20090403/#canon
> [2] http://www.w3.org/TR/2008/REC-xml-20081126/#syntax
> 
> 
> -----Original Message-----
> From: public-powderwg-request@w3.org [mailto:public-powderwg-request@w3.org] On Behalf Of Phil Archer
> Sent: 16 April 2009 10:08
> To: Public POWDER
> Cc: Thomas Roessler
> Subject: Results of Canonicalization implementation and proposed changes (was Re: Canonicalization implementation)
> 
> Hi all,
> 
> I'm maintaining the e-mail thread here but this mail is something of a departure as it ends up with proposed changes to the canonicalisation sections of the Grouping Doc.
> 
> I've been working on the various issues raised by Thomas Roessler regarding canonicalisation. This has been useful and does, I dare to hope, lead to something sensible. First some clearing up.
> 
> The current spec[1] confuses transmission of data as the output of a form and what you actually end up with after you've completed the form processing. This is the origin of the following statements:
> 
> * Percent encoded triples are converted into the characters they represent (e.g. %c3%a7 becomes ç etc.). Note the hexadecimal digits are case-insensitive. However, *except for the query string*, reserved characters as per Section 6.2.2.2 of RFC 3986 [URIS] must not be converted to literals, as that may invalidate the URI/IRI - the reason a URI would contain (for example) %2F instead of / would be to distinguish between a literal /, such as in 'his/hers', and the / which is used as a path separator.
> 
> * + characters in the query string are converted to spaces
> 
> This is wrong.
> 
> Processing of form data is quite separate and defined elsewhere. If you enter "http://example.com/Staff/Fran%c3%a7ois/mugshot.jpg" as (GET) form data, what gets transmitted is "http%3A%2F%2Fexample.com%2FStaff%2FFran%25c3%25a7ois%2Fmugshot.jpg" 
> note that the c cedilla is actually transmitted as %25c3%25a7, i.e. the % character is itself percent encoded (%25). Once you've got this back out of a standard form processing script you have, well back to what you started with which is http://example.com/Staff/Fran%c3%a7ois/mugshot.jpg.
> 
> Now, we _do_ want to process this, but all the stuff about query strings and + characters being transliterated into spaces is misplaced.
> 
> And, here's another error, the reference to the URI spec is incorrect. 
> Reserved characters are defined in section 2.2 of RFC 3986, not 6.2.2.2.
> 
> But... we also need NOT to convert %20 into spaces since to do so creates an invalid URI. Our current examples table has this:
> 
> http://example.com/my%20doc.doc
> canonicalised to
> http://example.com/my doc.doc
> 
> That can't be right, so I think we should also say that spaces should also be left alone. It's probably possible to expand this into white space in general but I think that's going beyond the call of duty.
> 
> I therefore propose to remove the reference to XFORMS from the intro to
> 2.1.4 and change the opening lines of section 2.1.4.1 to:
> 
> * If not already so encoded, the IRI character string is converted into a sequence of Unicode [UNICODE] characters.
> 
> * Percent encoded triples are converted into the characters they represent (e.g. %c3%a7 becomes ç etc.). Note the hexadecimal digits are case-insensitive. However space characters (%20) and reserved characters as per Section 2.2 of RFC 3986 [URIS] must not be converted to literals, as that may invalidate the URI/IRI - the reason a URI would contain (for
> example) %2F instead of / would be to distinguish between a literal /, such as in 'his/hers', and the / which is used as a path separator.
> 
> Drop the line about changing + into spaces completely.
> 
> Normalise to Form C is fine.
> 
> 2.1.4.2 Default values and case folding
> Is fine, although I did them in a different order so that I'd rearrange it a little to say:
> 
> * Where the authority is present, but the scheme is absent, the scheme should default to http.
> 
> * If the Path is absent, a path of / is appended.
> 
> * Change the scheme *and host* string to ASCII lowercase.
> 
> * Trailing . characters in the host are removed, i.e. 
> http://www.example.com./ becomes http://www.example.com/
> 
> * If the host string does not completely consist of ASCII characters, apply the ToASCII operation to the host string, with the UseSTD3ASCIIRules flag unset and the AllowUnassigned flag set [RFC 3490]. Note that behavior if the ToASCII operation fails is undefined.
> 
> * If the port is specified, but it is the default port for the scheme, it is removed.
> 
> The table of examples is fine *except* that the IDN for the signma example should be xn--sigma-kde.example.org not http://xn--sigma-g1e.example.org (unless both the VeriSign tool [2] and LibIDN are wrong!)
> 
> I built a standalone IRI canonicalisation tool at [3] to do all this (source code is available). You can just get the answer or get it to tell you what it's doing at each step if you select the verbose output. 
> You can see an example including sigma and François at [4].
> 
> And so to put this into practice matching against some real (really
> awkward) data. For this purpose I created some examples at [5]. The first two files contain François and Céline in the data both as utf8 characters and their %encoded versions. Either should match a suitable IRI, which they do: [6].
> 
> Now to achieve this in my own application I did the canonicalisation stuff early and stored the data in the MySQL database already canonicalised (getting MySQL to work in UTF 8 is a pain and I had to decode the %triples in the XML file *before* parsing it to stop Perl mucking it up but hey, that's late night coding frustration for you!) . 
> Andrea's POWDER Processor works differently so I'm going to be interested to see how he handles it but following the earlier logic, I end up wanting to amend section 2.1.5 slightly thus:
> 
> * If not already so encoded, the strings are converted into a sequence of Unicode characters. [No change]
> 
> * With the exception of *spaces and * the reserved characters defined in Section 2.2 of RFC 3986 [URIS], percent encoded triples are converted into the characters they represent. [slight change]
> 
> [Delete line about the query string]
> 
> * If the data relates to the host, trailing . characters are removed. 
> [Same, but higher in the sequence]
> 
> * If the data relates to the scheme *or host*, it is normalized to ASCII lower case. [Add in mention of host, it's easier to convert ASCII to lower case before the IDN stuff!]
> 
> * If the data relates to the host, and does not completely consist of ASCII characters, the ToASCII operation is applied as described in Section 2.1.3 [No change]
> 
> * Any values given for the IRI constraints includepathstartswith, excludepathstartswith, includeexactpaths or excludeexactpaths must begin with the / character which is pre-pended if absent. [No change]
> 
> Feel free to play with the processor at [7].
> 
> Next steps:
> 1. Create updated version of the doc and hope that since we've flagged this so heavily we're not going to be forced back into LC again! i.e. we can use these relatively small changes directly in the PR version.
> 2. Add the canonicalisation data to the test suite 3. Check it all works for Andrea too.
> 
> 
> [1] http://www.w3.org/TR/2009/WD-powder-grouping-20090403/#canon
> [2] http://mct.verisign-grs.com/
> [3] http://i-sieve.com/cgi-bin/canon.cgi
> [4] http://tinyurl.com/cyerr4
> [5] http://www.w3.org/2007/powder/Group/powder-test/tests/canon_tests/
> [6] http://tinyurl.com/djuzfm
> [7] http://www.i-sieve.com/cgi-bin/processor.cgi
> 
> 
> 
> 
> 
> 
> Phil Archer wrote:
>> OK, just before I shut down for a family break over Easter I've 
>> managed to make some progress with this. The problem with the 
>> normalisation to Form C was due to the string already being in that 
>> form. Re-applying the normalisation to a string that is already NFC 
>> caused a problem, at least with the Perl library I am using [1].
>>
>> Thankfully that module includes a check routine so the the 
>> normalisation only runs if it's needed. I really could do with finding 
>> a string that isn't already in NFC to test this out.
>>
>> With that problem fixed the standalone canonicalisation tool [2] now 
>> seems to be running smoothly. Following this, and with full 
>> acknowledgement that Thomas is well within his rights to say "I told 
>> you so" the changes to the canonicalisation routine need to be:
>>
>> 1. Forget the + to spaces (see original mail) 2. Forget the 'except 
>> the query string' - same reason 3. Switch the order of removing any 
>> trailing . characters from the host and applying the ToASCII function.
>>
>> I need to document this before people spend time looking at this. 
>> Tuesday... (family life really is taking over here, need to stop).
>>
>> Cheers
>>
>> Phil.
>>
>> [1] 
>> http://search.cpan.org/~sadahiro/Unicode-Normalize-1.02/Normalize.pm
>>
>> Phil Archer wrote:
>>> Thomas,
>>>
>>> Having got the docs published on Friday and all the e-mails sent 
>>> today, I turned to revising my implementation of the canonicalisation 
>>> steps (http://i-sieve.com/cgi-bin/canon.cgi). My short term aim is to 
>>> create a standalone tool that demonstrates the canonicalisation steps 
>>> for candidate IRIs and POWDER doc data so we can play around, create 
>>> some test data etc.
>>>
>>> OK, so first things first. I think I now see where my confusion has 
>>> been wrt. %decoding a +/space transliteration - it's all in the 
>>> form/cgi stuff. The form parsing script I use includes these lines:
>>>
>>> tr/+/ /;
>>> s/%(..)/pack("c",hex($1))/ge;
>>>
>>> which does the transliteration and then the %decoding, which is 
>>> correct and that's why it's in the doc and so firmly fixed in my head.
>>> However... I see where you're concerned and correct too - all that 
>>> does is to make sure that the encoding that the browser does is 
>>> reversed so if I put in
>>>
>>> http://example.com/staff/Fran%c3%a7ois
>>>
>>> I get out...
>>>
>>> http://example.com/staff/Fran%c3%a7ois
>>>
>>> Hmmm... not the desired outcome. It's necessary _after_ the initial 
>>> form decoding to _then_ do a second round of % decoding to get to 
>>> what we actually want which is the c cedila in the name.
>>>
>>> OK, so in terms of the spec, I can see that the line in section
>>> 2.1.4.1 that says:
>>>
>>> Percent encoded triples are converted into the characters they 
>>> represent... is correct and should stay. But
>>>
>>> + characters in the query string are converted to spaces
>>>
>>> is not correct and should go (an IRI with me%20+%20you in the query 
>>> string would map to one with 3 spaces which is wrong).
>>>
>>> Right, moving on through the canonicalisation steps.
>>>
>>> I found the Perl module that does the LibIDN stuff and got the 
>>> i-sieve hosting company to install that successfully. So that 
>>> something like
>>> this:
>>>
>>> €ürö.example.com.?me+you=them,finally=this
>>>
>>> is canonicalised properly to
>>>
>>> http://xn--r-1gaq1653a.example.com/?me+you=them,finally=this
>>>
>>> Whoopee!
>>>
>>> Aha... but I missed something. I had to get the hosting company to 
>>> install a library to normalise the string to Form C as well and that 
>>> took a little longer. OK, now it is in place so these work with or 
>>> without normalisation to Form C:
>>>
>>> http://example.com/staff/Fran%c3%a7ois
>>>
>>> http://example.com/my%20doc.doc
>>>
>>> http://www.example.com/foo/his%2Fhers
>>>
>>> See for yourself at http://i-sieve.com/cgi-bin/canon.cgi
>>>
>>> Now try
>>>
>>> €ürö.example.com.?me+you=them,finally=this
>>>
>>> That works too, right? That's because I've made the normalisation to 
>>> Form C optional. Switch it on and the ToASCII function fails (verbose 
>>> output is switched on in case of an error).
>>>
>>> Now, it would help enormously if I could test whether the Form C 
>>> thing is working properly or whether I need to do something more to 
>>> the code to make it work properly.
>>>
>>> Something you can help with perhaps please?
>>>
>>> Phil.
>>>
> 

-- 

Phil Archer
http://philarcher.org/www@20/

i-sieve technologies                |      W3C Mobile Web Initiative
Making Sense of the Buzz            |      www.w3.org/Mobile
Received on Friday, 17 April 2009 07:37:06 UTC