FW: comment: powder grouping handling of IRIs... from Phillips, Addison on 2008-10-01 (public-i18n-core@w3.org from October to December 2008)

From: Phillips, Addison <addison@amazon.com>
Date: Tue, 30 Sep 2008 17:56:05 -0700
To: "public-i18n-core@w3.org" <public-i18n-core@w3.org>
CC: "public-powderwg@w3.org" <public-powderwg@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA014C59864A@EX-SEA5-D.ant.amazon.com>

Argh... Outlook destroyed the mail address for the i18n wg list. I apologize for double-posting to the POWDER list, but want to ensure that the message we all refer to appears in both archives.

Addison

-----Original Message-----
From: Phillips, Addison 
Sent: Tuesday, September 30, 2008 5:53 PM
To: 'public-powderwg@w3.org'
Cc: 'pub'
Subject: comment: powder grouping handling of IRIs...

Dear POWDER-WG,

(this is on behalf of the Internationalization Core WG)

During our most recent teleconference, we reviewed your Last Call document [1] on POWDER Grouping of Resources. I am writing as a result of our discussion [2]. We recognize that your last call ended on the 14th and apologize for sending you comments after that date.

The Internationalization WG is particularly concerned about Section 2.1.3, in which IRI canonicalization is handled. This section concerns us, in part, because there are some corner cases not handled and some issues we hadn't maybe documented very well in the past. In particular, it isn't clear if the IRI text is normalized to any particular Unicode Normalization Form [cf. 3] and when this conversion occurs in the tokenization process.

Also, there is a step dealing with percent-encoded values that reads:

--
Percent encoded triples are converted into the characters they represent (e.g. %c3%a7 becomes ç etc.).
--

This presupposes that all percent-encoded sequences represent a UTF-8 byte sequence, which may not be correct. It also omits mention of non-shortest form UTF-8 or the encoding of pure byte values (the former is illegal and is a security risk, the latter is permitted by RFC 3987 and exists as a corner case).

The document also fails to mention a normalization step to ensure that the IRI is in some Unicode normalization form. If percent-escapes are decoded, we theorize that the proper thing to do would be to normalize to Form C before parsing into tokens. This would help ensure that tokens are 'include-normalized' (although it would not guarantee that fact).

We also note that there are several mentions in this section of mapping host parts to lowercase. Casefolding is applied to IDNA names, but it is not as simple an operation as for ASCII domain names. 

There are other issues related to working with IRIs. As a result of examining this, we propose to write as soon as practical a guideline document that will be incorporated into Character Model (in [4]) that will help your group and others to act as a reference for this sort of complex IRI parsing in the future. We would like to know if this will help you and how best to coordinate our actions with your needs in this area.

Best Regards (for I18N Core),

Addison

[1] http://www.w3.org/TR/2008/WD-powder-grouping-20080815/#byIRIcomp

[2] http://www.w3.org/2008/09/24-core-minutes.html#action02

[3] http://www.w3.org/TR/charmod-norm

[4] http://www.w3.org/TR/charmod-resid

Addison Phillips
Globalization Architect -- Lab126
Chair -- W3C Internationalization Core WG

Internationalization is not a feature.
It is an architecture.

Received on Wednesday, 1 October 2008 00:56:43 UTC