Re: comment: powder grouping handling of IRIs... from Phil Archer on 2009-01-21 (public-i18n-core@w3.org from January to March 2009)

From: Phil Archer <phil@philarcher.org>
Date: Wed, 21 Jan 2009 09:50:03 +0000
To: "Phillips, Addison" <addison@amazon.com>
CC: "public-i18n-core@w3.org" <public-i18n-core@w3.org>, Public POWDER <public-powderwg@w3.org>
Message-ID: <4976EFCB.6090106@philarcher.org>
Thanks once again. Sorry I misunderstood one of your suggestions in the 
previous mail.

I rewrote the introductory sentence, put things in the right order and, 
assuming there are no further misunderstandings on my part, I dare to 
hope that all is now well at 
http://philarcher.org/powder/grouping/20090121.html#canon?

One of my near future tasks is to move the processor I've built to a new 
home. During that manoeuvre I'll try and add in one of the Perl modules 
that Charmod points to.

Your help in this regard is very much appreciated.

Phil.

Phillips, Addison wrote:
> Hi Phil,
> 
> Looking better, but still not quite there.
> 
> Normalization to Form C is an operation that has to take into account the other operations you're doing. I think that the proper sequence of steps should probably be:
> 
> 1. If not already so encoded, convert the IRI to a sequence of Unicode characters.
> 2. Unescape any percent-encoded triples.
> 3. Normalize the string to Unicode Normalization Form C (NFC).
> // the remaining steps remain
> 
> It's important to do it in this order because the each of the preceding two steps might produce a non-normalized result. For example, the sequence "%cc%80" encodes the character U+0300 (a combining mark in Unicode). If you unescape this *after* applying Form C, you might end up with a non-normalized character sequence. For example, "E%cc%80" would end up being U+0045 U+0300 instead of the normalized U+00C8 (È).
> 
>> Incidentally, I have added your name to the acknowledgements (so
>> you get some of the blame if it's wrong ;-) )
> 
> Many thanks: I live for blame.
> 
> Addison
> 
> Addison Phillips
> Globalization Architect -- Lab126
> 
> Internationalization is not a feature.
> It is an architecture.
> 
> 
>> -----Original Message-----
>> From: Phil Archer [mailto:phil@philarcher.org]
>> Sent: Tuesday, January 20, 2009 1:52 AM
>> To: Phillips, Addison
>> Cc: public-i18n-core@w3.org; Public POWDER
>> Subject: Re: comment: powder grouping handling of IRIs...
>>
>> Thanks very much indeed, Addison.
>>
>> I've switched things around and tried to reduce the number of
>> superfluous mentions of URIs. The results are now at [1].
>>
>> I have one further question, born of my unfamiliarity with the
>> issues
>> under discussion. The opening sentences of the section currently
>> say:
>>
>> Before any IRI matching can take place the candidate resource's IRI
>> should be normalized to Form C, as defined in Character Model for
>> the
>> World Wide Web 1.0: Normalization [CHARMOD-NORM]. The following
>> further
>> steps should then be carried out...
>>
>> Which of these is true please:
>>
>> 1. One normalises the string to Form C and then carries out the
>> further
>> steps as described (in which case the current text is correct).
>>
>> 2. By carrying out the steps one normalises the string to Form C
>> (in
>> which case the current text needs a slight amendment)
>>
>> I /think/ 2 is correct? but I'm just not sure enough to make the
>> change.
>>
>> Incidentally, I have added your name to the acknowledgements (so
>> you get
>> some of the blame if it's wrong ;-) )
>>
>> Thanks again
>>
>> Phil.
>>
>>
>> [1] http://philarcher.org/powder/grouping/20090120.html#canon
>> Disclaimer: please note that this is a temporary URI and this is
>> not an
>> official W3C publication.
>>
>> Phillips, Addison wrote:
>>> Hi Phil, (this is a personal reply not--or at least not-yet--
>> endorsed by the I18N WG)
>>> Thanks for the note. This looks pretty good. However, I do have
>> some comments.
>>> 1. The case and path normalization steps occur before the IRI is
>> converted to Unicode, Unicode-normalized, and percent escapes
>> removed. This should be reversed. For example, both %c3%80 and
>> %C3%80 represent the uppercase letter 'À'. I think the intention is
>> to normalize the unescaped characters rather than the escapes.
>> Further, not removing escapes and converting to Unicode may expose
>> security flaws in processing (where escaped values should have been
>> normalized and produce false matches, additional trailing dots/path
>> elements, etc.). You do cover the %2F case, which is good.
>>> 2. You have a step for case normalizing portions of the IRI
>> (particularly the host). Case normalization is locale-sensitive and
>> is not limited to non-ASCII characters. See [1]. So where it says:
>>> --
>>> Therefore characters in these components in the candidate URI/IRI
>> are normalized to lower case where applicable.
>>> --
>>>
>>> I wound recommend that you say:
>>>
>>> --
>>> Therefore, where applicable, characters in these components in
>> the candidate URI/IRI are normalized to lower case using the
>> default Unicode case mapping.
>>> --
>>>
>>> 3. Observation: although the text talks about "URI/IRI"
>> consistently as a pairing, the net result of your algorithm is
>> conversion of all URIs to IRIs and the processing is as an IRI
>> after that. It might be useful to acknowledge this. There is no
>> reason to flip back-and-forth between RFCs 3986 and 3987.
>>>
>>> This is just a peek on my part. Hope this helps.
>>>
>>> Addison
>>>
>>> [1] http://www.w3.org/International/wiki/Case_folding
>>>
>>> Addison Phillips
>>> Globalization Architect -- Lab126
>>>
>>> Internationalization is not a feature.
>>> It is an architecture.
>>>
>>>
>>>> -----Original Message-----
>>>> From: Phil Archer [mailto:phil@philarcher.org]
>>>> Sent: Monday, January 19, 2009 6:07 AM
>>>> To: Phillips, Addison
>>>> Cc: public-i18n-core@w3.org; Public POWDER
>>>> Subject: Re: comment: powder grouping handling of IRIs...
>>>>
>>>> Addison,
>>>>
>>>> This is an old thread but I need to pick it up again.
>>>>
>>>> I'm in the middle of preparing for a transition request to PR
>> and,
>>>> as
>>>> part of my checks I see that although we did incorporate your
>>>> changes in
>>>> full, I've not had the decency to let you know this and to check
>>>> that
>>>> the text is now in accordance with i8n recommendations. So,
>> first
>>>> of
>>>> all, I apologise sincerely for this oversight and downright
>>>> rudeness!
>>>>
>>>> Secondly, may I ask you please to take a quick peek at an
>>>> unofficial,
>>>> not published by the W3C, editors' draft of the relevant section
>> at
>>>> [1].
>>>> The aim was to use exactly your words and suggestions. I hope we
>>>> got it
>>>> right?
>>>>
>>>> Thank you. And, again, sincere apologies for not sending this
>> weeks
>>>> ago
>>>>
>>>> Phil.
>>>>
>>>> [1] http://philarcher.org/powder/grouping/20090107.html#canon
>>>>
>>>>
>>>> Phillips, Addison wrote:
>>>>> Hi Phil,
>>>>>
>>>>> Thanks for the response. Some personal comments follow.
>>>>>
>>>>> Addison
>>>>>
>>>>> Addison Phillips
>>>>> Globalization Architect -- Lab126
>>>>>
>>>>> Internationalization is not a feature.
>>>>> It is an architecture.
>>>>>
>>>>>
>>>>>> Many thanks to you and the i18n WG for taking the time and
>>>> trouble
>>>>>> to
>>>>>> look at our document. The problem of IRI canonicalisation was
>>>>>> raised by
>>>>>> Thomas Roessler [1] and Eric P [2] following earlier drafts
>> with
>>>>>> further
>>>>>> comments from others - all very welcome. If one thing was
>> clear
>>>>>> after
>>>>>> those discussions it's that IRI canonicalisation is a
>> difficult
>>>>>> thing
>>>>>> and touches on issues well beyond the scope and expertise of
>> the
>>>>>> POWDER WG.
>>>>> It may not be quite *that* difficult, but it needs to be better
>>>> documented. Individual specs like POWDER shouldn't have to
>> invent
>>>> it each time.
>>>>>> Well, we do say just above the bullet point you quote that "If
>>>> not
>>>>>> already so encoded, the IRI/URI character string is converted
>>>> into
>>>>>> a sequence of bytes using the UTF-8 encoding."
>>>>> I... saw that and thought "oh, good, UTF-8", but on second
>>>> thought...... Are you sure you mean "sequence of bytes" here?
>> Maybe
>>>> you should say "sequence of Unicode characters" ([sic] code
>> points)
>>>> instead. The particular Unicode encoding used to encode the
>>>> characters is a matter for the implementation and the regular
>>>> expression stuff works just as well if not better with
>> characters.
>>>>>>> The document also fails to mention a normalization step to
>>>> ensure
>>>>>> that the IRI is in
>>>>>> some Unicode normalization form. If percent-escapes are
>> decoded,
>>>> we
>>>>>> theorize that the proper
>>>>>> thing to do would be to normalize to Form C before parsing
>> into
>>>>>> tokens.
>>>>>> This would help ensure
>>>>>> that tokens are 'include-normalized' (although it would not
>>>>>> guarantee
>>>>>> that fact).
>>>>>>
>>>>>> OK, tokenisation refers to the data not the IRI - I'll come to
>>>> that.
>>>>> Yes, but you extract IRI components in this section. That's
>>>> really what I meant.
>>>>>>> We also note that there are several mentions in this section
>> of
>>>>>> mapping host parts to lowercase.
>>>>>> Casefolding is applied to IDNA names, but it is not as simple
>> an
>>>>>> operation as for ASCII domain names.
>>>>>>
>>>>>> OK, I'm obviously trying to make sure that we don't say
>> anything
>>>>>> that is
>>>>>> incorrect or ambiguous so we need to do more here. At present,
>>>> the
>>>>>> whole
>>>>>> section begins thus:
>>>>>>
>>>>>> "Before any IRI or URI matching can take place the following
>>>>>> canonicalization steps should be applied to the candidate
>>>>>> resource's IRI
>>>>>> or URI. These steps are consistent with RFC3986 [URIS],
>> RFC3987
>>>>>> [IRIS],
>>>>>> URISpace [URISpace] and XForms [XFORMS]."
>>>>>>
>>>>>> Would this be more appropriate:
>>>>>>
>>>>>> Before any IRI matching can take place the candidate
>> resource's
>>>> IRI
>>>>>> should be Fully Normalized to Form C, as defined in Character
>>>> Model
>>>>> s/Fully//
>>>>>
>>>>>> for
>>>>>> the World Wide Web 1.0: Normalization [CHARMOD] using utf-8.
>> The
>>>>> s/using utf-8//
>>>>>
>>>>>> following further steps should then be carried out which are
>>>>>> consistent
>>>>>> with RFC3986 [URIS], RFC3987 [IRIS], URISpace [URISpace] and
>>>> XForms
>>>>>> [XFORMS].
>>>>>>
>>>>>> AND modify the line about schemes and hosts to say
>>>>>>
>>>>>> The scheme and host are case insensitive but the canonical
>> form
>>>> of
>>>>>> both
>>>>>> *(for ascii characters)* is lower case. Therefore *ascii
>>>>>> characters* in
>>>>>> these components in the candidate URI/IRI are normalized to
>>>> lower
>>>>>> case.
>>>>> The non-ASCII characters are also normalized to lowercase
>> (where
>>>> applicable) during STRINGPREP before Punycode is applied.
>> However,
>>>> Punycode can encode *any* Unicode character sequence, not just
>> ones
>>>> that have been stringprepped.
>>>>>> Later, in section 2.1.4 which deals with data encoding, we
>> begin
>>>> by
>>>>>> saying
>>>>>>
>>>>>> "If not already so encoded, the strings are converted into a
>>>>>> sequence of
>>>>>> bytes using the UTF-8 encoding."
>>>>>>
>>>>>> Again, we can extend this a little to say that the data should
>>>> be
>>>>>> Fully
>>>>>> Normalized to Form C.
>>>>> Again, don't say "fully" and probably not UTF-8.
>>>>>
>>>>>>> There are other issues related to working with IRIs. As a
>>>> result
>>>>>> of examining
>>>>>> this, we propose to write as soon as practical a guideline
>>>> document
>>>>>> that
>>>>>> will
>>>>>> be incorporated into Character Model (in [4]) that will help
>>>> your
>>>>>> group and
>>>>>> others to act as a reference for this sort of complex IRI
>>>> parsing
>>>>>> in the
>>>>>> future.
>>>>>> We would like to know if this will help you and how best to
>>>>>> coordinate our
>>>>>> actions with your needs in this area.
>>>>>>
>>>>>> That would certainly be most helpful - passing detail off to
>> the
>>>>>> experts
>>>>>> is generally a good idea! My worry is one of process. The
>>>> CHARMOD
>>>>>> doc is
>>>>>>   a working draft dated 2005 - and we're heading for CR this
>>>> month
>>>>>> with
>>>>>> Rec expected by year end (when our charter runs out).
>>>>> Yes, understood. We don't want to be a blocker. We are
>> proposing
>>>> to prepare a document that is later incorporated into CHARMOD-
>> NORM
>>>> so that you have a reference sooner and which I expect we would
>>>> publish to Note status. In the meantime, we will help you get
>> the
>>>> text you need in place.
>>>>>> In view of the stages along the Rec Track that the documents
>> are
>>>>>> currently at, and likely to be at, we may have to refer to the
>>>>>> CHARMOD
>>>>>> doc and the guideline you're working on as an extra source of
>>>>>> useful
>>>>>> information?
>>>>> Referring to CHARMOD is always good :-). Note that the
>>>> Fundamentals part of CHARMOD is a REC and has valuable
>> information
>>>> in it. http://www.w3.org/TR/charmod
>>>> --
>>>> Phil Archer
>>>> w. http://philarcher.org/
>> --
>> Phil Archer
>> w. http://philarcher.org/

-- 
Phil Archer
w. http://philarcher.org/
Received on Wednesday, 21 January 2009 09:50:52 UTC