Re: comment: powder grouping handling of IRIs... from Phil Archer on 2009-01-19 (public-i18n-core@w3.org from January to March 2009)

From: Phil Archer <phil@philarcher.org>
Date: Mon, 19 Jan 2009 14:07:20 +0000
To: "Phillips, Addison" <addison@amazon.com>
CC: "public-i18n-core@w3.org" <public-i18n-core@w3.org>, Public POWDER <public-powderwg@w3.org>
Message-ID: <49748918.4000708@philarcher.org>
Addison,

This is an old thread but I need to pick it up again.

I'm in the middle of preparing for a transition request to PR and, as 
part of my checks I see that although we did incorporate your changes in 
full, I've not had the decency to let you know this and to check that 
the text is now in accordance with i8n recommendations. So, first of 
all, I apologise sincerely for this oversight and downright rudeness!

Secondly, may I ask you please to take a quick peek at an unofficial, 
not published by the W3C, editors' draft of the relevant section at [1]. 
The aim was to use exactly your words and suggestions. I hope we got it 
right?

Thank you. And, again, sincere apologies for not sending this weeks ago

Phil.

[1] http://philarcher.org/powder/grouping/20090107.html#canon


Phillips, Addison wrote:
> Hi Phil,
> 
> Thanks for the response. Some personal comments follow.
> 
> Addison
> 
> Addison Phillips
> Globalization Architect -- Lab126
> 
> Internationalization is not a feature.
> It is an architecture.
> 
> 
>> Many thanks to you and the i18n WG for taking the time and trouble
>> to
>> look at our document. The problem of IRI canonicalisation was
>> raised by
>> Thomas Roessler [1] and Eric P [2] following earlier drafts with
>> further
>> comments from others - all very welcome. If one thing was clear
>> after
>> those discussions it's that IRI canonicalisation is a difficult
>> thing
>> and touches on issues well beyond the scope and expertise of the
>> POWDER WG.
> 
> It may not be quite *that* difficult, but it needs to be better documented. Individual specs like POWDER shouldn't have to invent it each time.
> 
>> Well, we do say just above the bullet point you quote that "If not
>> already so encoded, the IRI/URI character string is converted into
>> a sequence of bytes using the UTF-8 encoding." 
> 
> I... saw that and thought "oh, good, UTF-8", but on second thought...... Are you sure you mean "sequence of bytes" here? Maybe you should say "sequence of Unicode characters" ([sic] code points) instead. The particular Unicode encoding used to encode the characters is a matter for the implementation and the regular expression stuff works just as well if not better with characters.
> 
>>> The document also fails to mention a normalization step to ensure
>> that the IRI is in
>> some Unicode normalization form. If percent-escapes are decoded, we
>> theorize that the proper
>> thing to do would be to normalize to Form C before parsing into
>> tokens.
>> This would help ensure
>> that tokens are 'include-normalized' (although it would not
>> guarantee
>> that fact).
>>
>> OK, tokenisation refers to the data not the IRI - I'll come to that.
> 
> Yes, but you extract IRI components in this section. That's really what I meant.
> 
>>> We also note that there are several mentions in this section of
>> mapping host parts to lowercase.
>> Casefolding is applied to IDNA names, but it is not as simple an
>> operation as for ASCII domain names.
>>
>> OK, I'm obviously trying to make sure that we don't say anything
>> that is
>> incorrect or ambiguous so we need to do more here. At present, the
>> whole
>> section begins thus:
>>
>> "Before any IRI or URI matching can take place the following
>> canonicalization steps should be applied to the candidate
>> resource's IRI
>> or URI. These steps are consistent with RFC3986 [URIS], RFC3987
>> [IRIS],
>> URISpace [URISpace] and XForms [XFORMS]."
>>
>> Would this be more appropriate:
>>
>> Before any IRI matching can take place the candidate resource's IRI
>> should be Fully Normalized to Form C, as defined in Character Model
> 
> s/Fully//
> 
>> for
>> the World Wide Web 1.0: Normalization [CHARMOD] using utf-8. The
> 
> s/using utf-8//
> 
>> following further steps should then be carried out which are
>> consistent
>> with RFC3986 [URIS], RFC3987 [IRIS], URISpace [URISpace] and XForms
>> [XFORMS].
>>
>> AND modify the line about schemes and hosts to say
>>
>> The scheme and host are case insensitive but the canonical form of
>> both
>> *(for ascii characters)* is lower case. Therefore *ascii
>> characters* in
>> these components in the candidate URI/IRI are normalized to lower
>> case.
> 
> The non-ASCII characters are also normalized to lowercase (where applicable) during STRINGPREP before Punycode is applied. However, Punycode can encode *any* Unicode character sequence, not just ones that have been stringprepped. 
> 
>>
>> Later, in section 2.1.4 which deals with data encoding, we begin by
>> saying
>>
>> "If not already so encoded, the strings are converted into a
>> sequence of
>> bytes using the UTF-8 encoding."
>>
>> Again, we can extend this a little to say that the data should be
>> Fully
>> Normalized to Form C.
> 
> Again, don't say "fully" and probably not UTF-8.
> 
>>> There are other issues related to working with IRIs. As a result
>> of examining
>> this, we propose to write as soon as practical a guideline document
>> that
>> will
>> be incorporated into Character Model (in [4]) that will help your
>> group and
>> others to act as a reference for this sort of complex IRI parsing
>> in the
>> future.
>> We would like to know if this will help you and how best to
>> coordinate our
>> actions with your needs in this area.
>>
>> That would certainly be most helpful - passing detail off to the
>> experts
>> is generally a good idea! My worry is one of process. The CHARMOD
>> doc is
>>   a working draft dated 2005 - and we're heading for CR this month
>> with
>> Rec expected by year end (when our charter runs out).
> 
> Yes, understood. We don't want to be a blocker. We are proposing to prepare a document that is later incorporated into CHARMOD-NORM so that you have a reference sooner and which I expect we would publish to Note status. In the meantime, we will help you get the text you need in place.
> 
>> In view of the stages along the Rec Track that the documents are
>> currently at, and likely to be at, we may have to refer to the
>> CHARMOD
>> doc and the guideline you're working on as an extra source of
>> useful
>> information?
> 
> Referring to CHARMOD is always good :-). Note that the Fundamentals part of CHARMOD is a REC and has valuable information in it. http://www.w3.org/TR/charmod
> 
> 

-- 
Phil Archer
w. http://philarcher.org/
Received on Monday, 19 January 2009 14:08:08 UTC