Re: comment: powder grouping handling of IRIs... from Phil Archer on 2009-01-20 (public-i18n-core@w3.org from January to March 2009)

From: Phil Archer <phil@philarcher.org>
Date: Tue, 20 Jan 2009 09:52:14 +0000
To: "Phillips, Addison" <addison@amazon.com>
CC: "public-i18n-core@w3.org" <public-i18n-core@w3.org>, Public POWDER <public-powderwg@w3.org>
Message-ID: <49759ECE.1090704@philarcher.org>
Thanks very much indeed, Addison.

I've switched things around and tried to reduce the number of 
superfluous mentions of URIs. The results are now at [1].

I have one further question, born of my unfamiliarity with the issues 
under discussion. The opening sentences of the section currently say:

Before any IRI matching can take place the candidate resource's IRI 
should be normalized to Form C, as defined in Character Model for the 
World Wide Web 1.0: Normalization [CHARMOD-NORM]. The following further 
steps should then be carried out...

Which of these is true please:

1. One normalises the string to Form C and then carries out the further 
steps as described (in which case the current text is correct).

2. By carrying out the steps one normalises the string to Form C (in 
which case the current text needs a slight amendment)

I /think/ 2 is correct? but I'm just not sure enough to make the change.

Incidentally, I have added your name to the acknowledgements (so you get 
some of the blame if it's wrong ;-) )

Thanks again

Phil.


[1] http://philarcher.org/powder/grouping/20090120.html#canon
Disclaimer: please note that this is a temporary URI and this is not an 
official W3C publication.

Phillips, Addison wrote:
> Hi Phil, (this is a personal reply not--or at least not-yet--endorsed by the I18N WG)
> 
> Thanks for the note. This looks pretty good. However, I do have some comments.
> 
> 1. The case and path normalization steps occur before the IRI is converted to Unicode, Unicode-normalized, and percent escapes removed. This should be reversed. For example, both %c3%80 and %C3%80 represent the uppercase letter 'À'. I think the intention is to normalize the unescaped characters rather than the escapes. Further, not removing escapes and converting to Unicode may expose security flaws in processing (where escaped values should have been normalized and produce false matches, additional trailing dots/path elements, etc.). You do cover the %2F case, which is good.
> 
> 2. You have a step for case normalizing portions of the IRI (particularly the host). Case normalization is locale-sensitive and is not limited to non-ASCII characters. See [1]. So where it says:
> 
> --
> Therefore characters in these components in the candidate URI/IRI are normalized to lower case where applicable.
> --
> 
> I wound recommend that you say:
> 
> --
> Therefore, where applicable, characters in these components in the candidate URI/IRI are normalized to lower case using the default Unicode case mapping.
> --
> 
> 3. Observation: although the text talks about "URI/IRI" consistently as a pairing, the net result of your algorithm is conversion of all URIs to IRIs and the processing is as an IRI after that. It might be useful to acknowledge this. There is no reason to flip back-and-forth between RFCs 3986 and 3987.
> 
> 
> This is just a peek on my part. Hope this helps.
> 
> Addison
> 
> [1] http://www.w3.org/International/wiki/Case_folding
> 
> Addison Phillips
> Globalization Architect -- Lab126
> 
> Internationalization is not a feature.
> It is an architecture.
> 
> 
>> -----Original Message-----
>> From: Phil Archer [mailto:phil@philarcher.org]
>> Sent: Monday, January 19, 2009 6:07 AM
>> To: Phillips, Addison
>> Cc: public-i18n-core@w3.org; Public POWDER
>> Subject: Re: comment: powder grouping handling of IRIs...
>>
>> Addison,
>>
>> This is an old thread but I need to pick it up again.
>>
>> I'm in the middle of preparing for a transition request to PR and,
>> as
>> part of my checks I see that although we did incorporate your
>> changes in
>> full, I've not had the decency to let you know this and to check
>> that
>> the text is now in accordance with i8n recommendations. So, first
>> of
>> all, I apologise sincerely for this oversight and downright
>> rudeness!
>>
>> Secondly, may I ask you please to take a quick peek at an
>> unofficial,
>> not published by the W3C, editors' draft of the relevant section at
>> [1].
>> The aim was to use exactly your words and suggestions. I hope we
>> got it
>> right?
>>
>> Thank you. And, again, sincere apologies for not sending this weeks
>> ago
>>
>> Phil.
>>
>> [1] http://philarcher.org/powder/grouping/20090107.html#canon
>>
>>
>> Phillips, Addison wrote:
>>> Hi Phil,
>>>
>>> Thanks for the response. Some personal comments follow.
>>>
>>> Addison
>>>
>>> Addison Phillips
>>> Globalization Architect -- Lab126
>>>
>>> Internationalization is not a feature.
>>> It is an architecture.
>>>
>>>
>>>> Many thanks to you and the i18n WG for taking the time and
>> trouble
>>>> to
>>>> look at our document. The problem of IRI canonicalisation was
>>>> raised by
>>>> Thomas Roessler [1] and Eric P [2] following earlier drafts with
>>>> further
>>>> comments from others - all very welcome. If one thing was clear
>>>> after
>>>> those discussions it's that IRI canonicalisation is a difficult
>>>> thing
>>>> and touches on issues well beyond the scope and expertise of the
>>>> POWDER WG.
>>> It may not be quite *that* difficult, but it needs to be better
>> documented. Individual specs like POWDER shouldn't have to invent
>> it each time.
>>>> Well, we do say just above the bullet point you quote that "If
>> not
>>>> already so encoded, the IRI/URI character string is converted
>> into
>>>> a sequence of bytes using the UTF-8 encoding."
>>> I... saw that and thought "oh, good, UTF-8", but on second
>> thought...... Are you sure you mean "sequence of bytes" here? Maybe
>> you should say "sequence of Unicode characters" ([sic] code points)
>> instead. The particular Unicode encoding used to encode the
>> characters is a matter for the implementation and the regular
>> expression stuff works just as well if not better with characters.
>>>>> The document also fails to mention a normalization step to
>> ensure
>>>> that the IRI is in
>>>> some Unicode normalization form. If percent-escapes are decoded,
>> we
>>>> theorize that the proper
>>>> thing to do would be to normalize to Form C before parsing into
>>>> tokens.
>>>> This would help ensure
>>>> that tokens are 'include-normalized' (although it would not
>>>> guarantee
>>>> that fact).
>>>>
>>>> OK, tokenisation refers to the data not the IRI - I'll come to
>> that.
>>> Yes, but you extract IRI components in this section. That's
>> really what I meant.
>>>>> We also note that there are several mentions in this section of
>>>> mapping host parts to lowercase.
>>>> Casefolding is applied to IDNA names, but it is not as simple an
>>>> operation as for ASCII domain names.
>>>>
>>>> OK, I'm obviously trying to make sure that we don't say anything
>>>> that is
>>>> incorrect or ambiguous so we need to do more here. At present,
>> the
>>>> whole
>>>> section begins thus:
>>>>
>>>> "Before any IRI or URI matching can take place the following
>>>> canonicalization steps should be applied to the candidate
>>>> resource's IRI
>>>> or URI. These steps are consistent with RFC3986 [URIS], RFC3987
>>>> [IRIS],
>>>> URISpace [URISpace] and XForms [XFORMS]."
>>>>
>>>> Would this be more appropriate:
>>>>
>>>> Before any IRI matching can take place the candidate resource's
>> IRI
>>>> should be Fully Normalized to Form C, as defined in Character
>> Model
>>> s/Fully//
>>>
>>>> for
>>>> the World Wide Web 1.0: Normalization [CHARMOD] using utf-8. The
>>> s/using utf-8//
>>>
>>>> following further steps should then be carried out which are
>>>> consistent
>>>> with RFC3986 [URIS], RFC3987 [IRIS], URISpace [URISpace] and
>> XForms
>>>> [XFORMS].
>>>>
>>>> AND modify the line about schemes and hosts to say
>>>>
>>>> The scheme and host are case insensitive but the canonical form
>> of
>>>> both
>>>> *(for ascii characters)* is lower case. Therefore *ascii
>>>> characters* in
>>>> these components in the candidate URI/IRI are normalized to
>> lower
>>>> case.
>>> The non-ASCII characters are also normalized to lowercase (where
>> applicable) during STRINGPREP before Punycode is applied. However,
>> Punycode can encode *any* Unicode character sequence, not just ones
>> that have been stringprepped.
>>>> Later, in section 2.1.4 which deals with data encoding, we begin
>> by
>>>> saying
>>>>
>>>> "If not already so encoded, the strings are converted into a
>>>> sequence of
>>>> bytes using the UTF-8 encoding."
>>>>
>>>> Again, we can extend this a little to say that the data should
>> be
>>>> Fully
>>>> Normalized to Form C.
>>> Again, don't say "fully" and probably not UTF-8.
>>>
>>>>> There are other issues related to working with IRIs. As a
>> result
>>>> of examining
>>>> this, we propose to write as soon as practical a guideline
>> document
>>>> that
>>>> will
>>>> be incorporated into Character Model (in [4]) that will help
>> your
>>>> group and
>>>> others to act as a reference for this sort of complex IRI
>> parsing
>>>> in the
>>>> future.
>>>> We would like to know if this will help you and how best to
>>>> coordinate our
>>>> actions with your needs in this area.
>>>>
>>>> That would certainly be most helpful - passing detail off to the
>>>> experts
>>>> is generally a good idea! My worry is one of process. The
>> CHARMOD
>>>> doc is
>>>>   a working draft dated 2005 - and we're heading for CR this
>> month
>>>> with
>>>> Rec expected by year end (when our charter runs out).
>>> Yes, understood. We don't want to be a blocker. We are proposing
>> to prepare a document that is later incorporated into CHARMOD-NORM
>> so that you have a reference sooner and which I expect we would
>> publish to Note status. In the meantime, we will help you get the
>> text you need in place.
>>>> In view of the stages along the Rec Track that the documents are
>>>> currently at, and likely to be at, we may have to refer to the
>>>> CHARMOD
>>>> doc and the guideline you're working on as an extra source of
>>>> useful
>>>> information?
>>> Referring to CHARMOD is always good :-). Note that the
>> Fundamentals part of CHARMOD is a REC and has valuable information
>> in it. http://www.w3.org/TR/charmod
>>>
>> --
>> Phil Archer
>> w. http://philarcher.org/

-- 
Phil Archer
w. http://philarcher.org/
Received on Tuesday, 20 January 2009 09:52:58 UTC