Re: comment: powder grouping handling of IRIs... from Phil Archer on 2009-04-06 (public-i18n-core@w3.org from April to June 2009)

From: Phil Archer <phil@philarcher.org>
Date: Mon, 06 Apr 2009 17:37:29 +0100
To: "Phillips, Addison" <addison@amazon.com>
CC: Thomas Roessler <tlr@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>, Public POWDER <public-powderwg@w3.org>
Message-ID: <49DA2FC9.40701@philarcher.org>
Addison, and the wider i18n community,

As you may have seen, the POWDER WG has gone back to LC following 
further review of the Canonicalisation section of the Grouping of 
Resources document [1]. The steps we discussed in this thread previously 
have been included but now we're saying that IRIs should be compared 
after running the ToASCII function (following input from Thomas 
Roessler, hence he's on cc).

I've begun building standalone tool to implement this rather than 
keeping it buried inside the POWDER Processor where it tends to get 
buried in all the nice test results ;-). End result, I need your help 
again please.

The tool at http://i-sieve.com/cgi-bin/canon.cgi uses a couple of key 
Perl libraries: Net::LibIDN [2] and Unicode::Normalize 1.02[3] (I know 
there's a 1.03 but this is the only one available on my company hosting 
package).

My basic problem is that if I include the Normalize to Form C operation, 
the ToASCII function fails on this (and similar) input strings:

€ürö.example.com.?me+you=them,finally=this

If I omit that stage, it works as it should. Now... this doesn't mean I 
think it's wrong, just that I am beyond my ability to fix this. The test 
tool makes the application for normalisation optional so it's easy to 
play with this a little.

I would be most grateful for any further help you're able to offer.

Thank you.

Phil.

[1] http://www.w3.org/TR/2009/WD-powder-grouping-20090403/#idnCanon to 
section 2.1.5
[2] http://search.cpan.org/~thor/Net-LibIDN-0.12/_LibIDN.pm
[3] http://search.cpan.org/~sadahiro/Unicode-Normalize-1.02/Normalize.pm

Phillips, Addison wrote:
> Hi Phil,
> 
> Thanks for the response. Some personal comments follow.
> 
> Addison
> 
> Addison Phillips
> Globalization Architect -- Lab126
> 
> Internationalization is not a feature.
> It is an architecture.
> 
> 
>> Many thanks to you and the i18n WG for taking the time and trouble
>> to
>> look at our document. The problem of IRI canonicalisation was
>> raised by
>> Thomas Roessler [1] and Eric P [2] following earlier drafts with
>> further
>> comments from others - all very welcome. If one thing was clear
>> after
>> those discussions it's that IRI canonicalisation is a difficult
>> thing
>> and touches on issues well beyond the scope and expertise of the
>> POWDER WG.
> 
> It may not be quite *that* difficult, but it needs to be better documented. Individual specs like POWDER shouldn't have to invent it each time.
> 
>> Well, we do say just above the bullet point you quote that "If not
>> already so encoded, the IRI/URI character string is converted into
>> a sequence of bytes using the UTF-8 encoding." 
> 
> I... saw that and thought "oh, good, UTF-8", but on second thought...... Are you sure you mean "sequence of bytes" here? Maybe you should say "sequence of Unicode characters" ([sic] code points) instead. The particular Unicode encoding used to encode the characters is a matter for the implementation and the regular expression stuff works just as well if not better with characters.
> 
>>> The document also fails to mention a normalization step to ensure
>> that the IRI is in
>> some Unicode normalization form. If percent-escapes are decoded, we
>> theorize that the proper
>> thing to do would be to normalize to Form C before parsing into
>> tokens.
>> This would help ensure
>> that tokens are 'include-normalized' (although it would not
>> guarantee
>> that fact).
>>
>> OK, tokenisation refers to the data not the IRI - I'll come to that.
> 
> Yes, but you extract IRI components in this section. That's really what I meant.
> 
>>> We also note that there are several mentions in this section of
>> mapping host parts to lowercase.
>> Casefolding is applied to IDNA names, but it is not as simple an
>> operation as for ASCII domain names.
>>
>> OK, I'm obviously trying to make sure that we don't say anything
>> that is
>> incorrect or ambiguous so we need to do more here. At present, the
>> whole
>> section begins thus:
>>
>> "Before any IRI or URI matching can take place the following
>> canonicalization steps should be applied to the candidate
>> resource's IRI
>> or URI. These steps are consistent with RFC3986 [URIS], RFC3987
>> [IRIS],
>> URISpace [URISpace] and XForms [XFORMS]."
>>
>> Would this be more appropriate:
>>
>> Before any IRI matching can take place the candidate resource's IRI
>> should be Fully Normalized to Form C, as defined in Character Model
> 
> s/Fully//
> 
>> for
>> the World Wide Web 1.0: Normalization [CHARMOD] using utf-8. The
> 
> s/using utf-8//
> 
>> following further steps should then be carried out which are
>> consistent
>> with RFC3986 [URIS], RFC3987 [IRIS], URISpace [URISpace] and XForms
>> [XFORMS].
>>
>> AND modify the line about schemes and hosts to say
>>
>> The scheme and host are case insensitive but the canonical form of
>> both
>> *(for ascii characters)* is lower case. Therefore *ascii
>> characters* in
>> these components in the candidate URI/IRI are normalized to lower
>> case.
> 
> The non-ASCII characters are also normalized to lowercase (where applicable) during STRINGPREP before Punycode is applied. However, Punycode can encode *any* Unicode character sequence, not just ones that have been stringprepped. 
> 
>>
>> Later, in section 2.1.4 which deals with data encoding, we begin by
>> saying
>>
>> "If not already so encoded, the strings are converted into a
>> sequence of
>> bytes using the UTF-8 encoding."
>>
>> Again, we can extend this a little to say that the data should be
>> Fully
>> Normalized to Form C.
> 
> Again, don't say "fully" and probably not UTF-8.
> 
>>> There are other issues related to working with IRIs. As a result
>> of examining
>> this, we propose to write as soon as practical a guideline document
>> that
>> will
>> be incorporated into Character Model (in [4]) that will help your
>> group and
>> others to act as a reference for this sort of complex IRI parsing
>> in the
>> future.
>> We would like to know if this will help you and how best to
>> coordinate our
>> actions with your needs in this area.
>>
>> That would certainly be most helpful - passing detail off to the
>> experts
>> is generally a good idea! My worry is one of process. The CHARMOD
>> doc is
>>   a working draft dated 2005 - and we're heading for CR this month
>> with
>> Rec expected by year end (when our charter runs out).
> 
> Yes, understood. We don't want to be a blocker. We are proposing to prepare a document that is later incorporated into CHARMOD-NORM so that you have a reference sooner and which I expect we would publish to Note status. In the meantime, we will help you get the text you need in place.
> 
>> In view of the stages along the Rec Track that the documents are
>> currently at, and likely to be at, we may have to refer to the
>> CHARMOD
>> doc and the guideline you're working on as an extra source of
>> useful
>> information?
> 
> Referring to CHARMOD is always good :-). Note that the Fundamentals part of CHARMOD is a REC and has valuable information in it. http://www.w3.org/TR/charmod
> 
> 

-- 

Phil Archer
http://philarcher.org/www@20/

i-sieve technologies                |      W3C Mobile Web Initiative
Making Sense of the Buzz            |      www.w3.org/Mobile
Received on Monday, 6 April 2009 16:38:09 UTC