- From: Phillips, Addison <addison@amazon.com>
- Date: Tue, 20 Jan 2009 15:37:34 -0800
- To: Phil Archer <phil@philarcher.org>
- CC: "public-i18n-core@w3.org" <public-i18n-core@w3.org>, Public POWDER <public-powderwg@w3.org>
Hi Phil, Looking better, but still not quite there. Normalization to Form C is an operation that has to take into account the other operations you're doing. I think that the proper sequence of steps should probably be: 1. If not already so encoded, convert the IRI to a sequence of Unicode characters. 2. Unescape any percent-encoded triples. 3. Normalize the string to Unicode Normalization Form C (NFC). // the remaining steps remain It's important to do it in this order because the each of the preceding two steps might produce a non-normalized result. For example, the sequence "%cc%80" encodes the character U+0300 (a combining mark in Unicode). If you unescape this *after* applying Form C, you might end up with a non-normalized character sequence. For example, "E%cc%80" would end up being U+0045 U+0300 instead of the normalized U+00C8 (È). > Incidentally, I have added your name to the acknowledgements (so > you get some of the blame if it's wrong ;-) ) Many thanks: I live for blame. Addison Addison Phillips Globalization Architect -- Lab126 Internationalization is not a feature. It is an architecture. > -----Original Message----- > From: Phil Archer [mailto:phil@philarcher.org] > Sent: Tuesday, January 20, 2009 1:52 AM > To: Phillips, Addison > Cc: public-i18n-core@w3.org; Public POWDER > Subject: Re: comment: powder grouping handling of IRIs... > > Thanks very much indeed, Addison. > > I've switched things around and tried to reduce the number of > superfluous mentions of URIs. The results are now at [1]. > > I have one further question, born of my unfamiliarity with the > issues > under discussion. The opening sentences of the section currently > say: > > Before any IRI matching can take place the candidate resource's IRI > should be normalized to Form C, as defined in Character Model for > the > World Wide Web 1.0: Normalization [CHARMOD-NORM]. The following > further > steps should then be carried out... > > Which of these is true please: > > 1. One normalises the string to Form C and then carries out the > further > steps as described (in which case the current text is correct). > > 2. By carrying out the steps one normalises the string to Form C > (in > which case the current text needs a slight amendment) > > I /think/ 2 is correct? but I'm just not sure enough to make the > change. > > Incidentally, I have added your name to the acknowledgements (so > you get > some of the blame if it's wrong ;-) ) > > Thanks again > > Phil. > > > [1] http://philarcher.org/powder/grouping/20090120.html#canon > Disclaimer: please note that this is a temporary URI and this is > not an > official W3C publication. > > Phillips, Addison wrote: > > Hi Phil, (this is a personal reply not--or at least not-yet-- > endorsed by the I18N WG) > > > > Thanks for the note. This looks pretty good. However, I do have > some comments. > > > > 1. The case and path normalization steps occur before the IRI is > converted to Unicode, Unicode-normalized, and percent escapes > removed. This should be reversed. For example, both %c3%80 and > %C3%80 represent the uppercase letter 'À'. I think the intention is > to normalize the unescaped characters rather than the escapes. > Further, not removing escapes and converting to Unicode may expose > security flaws in processing (where escaped values should have been > normalized and produce false matches, additional trailing dots/path > elements, etc.). You do cover the %2F case, which is good. > > > > 2. You have a step for case normalizing portions of the IRI > (particularly the host). Case normalization is locale-sensitive and > is not limited to non-ASCII characters. See [1]. So where it says: > > > > -- > > Therefore characters in these components in the candidate URI/IRI > are normalized to lower case where applicable. > > -- > > > > I wound recommend that you say: > > > > -- > > Therefore, where applicable, characters in these components in > the candidate URI/IRI are normalized to lower case using the > default Unicode case mapping. > > -- > > > > 3. Observation: although the text talks about "URI/IRI" > consistently as a pairing, the net result of your algorithm is > conversion of all URIs to IRIs and the processing is as an IRI > after that. It might be useful to acknowledge this. There is no > reason to flip back-and-forth between RFCs 3986 and 3987. > > > > > > This is just a peek on my part. Hope this helps. > > > > Addison > > > > [1] http://www.w3.org/International/wiki/Case_folding > > > > Addison Phillips > > Globalization Architect -- Lab126 > > > > Internationalization is not a feature. > > It is an architecture. > > > > > >> -----Original Message----- > >> From: Phil Archer [mailto:phil@philarcher.org] > >> Sent: Monday, January 19, 2009 6:07 AM > >> To: Phillips, Addison > >> Cc: public-i18n-core@w3.org; Public POWDER > >> Subject: Re: comment: powder grouping handling of IRIs... > >> > >> Addison, > >> > >> This is an old thread but I need to pick it up again. > >> > >> I'm in the middle of preparing for a transition request to PR > and, > >> as > >> part of my checks I see that although we did incorporate your > >> changes in > >> full, I've not had the decency to let you know this and to check > >> that > >> the text is now in accordance with i8n recommendations. So, > first > >> of > >> all, I apologise sincerely for this oversight and downright > >> rudeness! > >> > >> Secondly, may I ask you please to take a quick peek at an > >> unofficial, > >> not published by the W3C, editors' draft of the relevant section > at > >> [1]. > >> The aim was to use exactly your words and suggestions. I hope we > >> got it > >> right? > >> > >> Thank you. And, again, sincere apologies for not sending this > weeks > >> ago > >> > >> Phil. > >> > >> [1] http://philarcher.org/powder/grouping/20090107.html#canon > >> > >> > >> Phillips, Addison wrote: > >>> Hi Phil, > >>> > >>> Thanks for the response. Some personal comments follow. > >>> > >>> Addison > >>> > >>> Addison Phillips > >>> Globalization Architect -- Lab126 > >>> > >>> Internationalization is not a feature. > >>> It is an architecture. > >>> > >>> > >>>> Many thanks to you and the i18n WG for taking the time and > >> trouble > >>>> to > >>>> look at our document. The problem of IRI canonicalisation was > >>>> raised by > >>>> Thomas Roessler [1] and Eric P [2] following earlier drafts > with > >>>> further > >>>> comments from others - all very welcome. If one thing was > clear > >>>> after > >>>> those discussions it's that IRI canonicalisation is a > difficult > >>>> thing > >>>> and touches on issues well beyond the scope and expertise of > the > >>>> POWDER WG. > >>> It may not be quite *that* difficult, but it needs to be better > >> documented. Individual specs like POWDER shouldn't have to > invent > >> it each time. > >>>> Well, we do say just above the bullet point you quote that "If > >> not > >>>> already so encoded, the IRI/URI character string is converted > >> into > >>>> a sequence of bytes using the UTF-8 encoding." > >>> I... saw that and thought "oh, good, UTF-8", but on second > >> thought...... Are you sure you mean "sequence of bytes" here? > Maybe > >> you should say "sequence of Unicode characters" ([sic] code > points) > >> instead. The particular Unicode encoding used to encode the > >> characters is a matter for the implementation and the regular > >> expression stuff works just as well if not better with > characters. > >>>>> The document also fails to mention a normalization step to > >> ensure > >>>> that the IRI is in > >>>> some Unicode normalization form. If percent-escapes are > decoded, > >> we > >>>> theorize that the proper > >>>> thing to do would be to normalize to Form C before parsing > into > >>>> tokens. > >>>> This would help ensure > >>>> that tokens are 'include-normalized' (although it would not > >>>> guarantee > >>>> that fact). > >>>> > >>>> OK, tokenisation refers to the data not the IRI - I'll come to > >> that. > >>> Yes, but you extract IRI components in this section. That's > >> really what I meant. > >>>>> We also note that there are several mentions in this section > of > >>>> mapping host parts to lowercase. > >>>> Casefolding is applied to IDNA names, but it is not as simple > an > >>>> operation as for ASCII domain names. > >>>> > >>>> OK, I'm obviously trying to make sure that we don't say > anything > >>>> that is > >>>> incorrect or ambiguous so we need to do more here. At present, > >> the > >>>> whole > >>>> section begins thus: > >>>> > >>>> "Before any IRI or URI matching can take place the following > >>>> canonicalization steps should be applied to the candidate > >>>> resource's IRI > >>>> or URI. These steps are consistent with RFC3986 [URIS], > RFC3987 > >>>> [IRIS], > >>>> URISpace [URISpace] and XForms [XFORMS]." > >>>> > >>>> Would this be more appropriate: > >>>> > >>>> Before any IRI matching can take place the candidate > resource's > >> IRI > >>>> should be Fully Normalized to Form C, as defined in Character > >> Model > >>> s/Fully// > >>> > >>>> for > >>>> the World Wide Web 1.0: Normalization [CHARMOD] using utf-8. > The > >>> s/using utf-8// > >>> > >>>> following further steps should then be carried out which are > >>>> consistent > >>>> with RFC3986 [URIS], RFC3987 [IRIS], URISpace [URISpace] and > >> XForms > >>>> [XFORMS]. > >>>> > >>>> AND modify the line about schemes and hosts to say > >>>> > >>>> The scheme and host are case insensitive but the canonical > form > >> of > >>>> both > >>>> *(for ascii characters)* is lower case. Therefore *ascii > >>>> characters* in > >>>> these components in the candidate URI/IRI are normalized to > >> lower > >>>> case. > >>> The non-ASCII characters are also normalized to lowercase > (where > >> applicable) during STRINGPREP before Punycode is applied. > However, > >> Punycode can encode *any* Unicode character sequence, not just > ones > >> that have been stringprepped. > >>>> Later, in section 2.1.4 which deals with data encoding, we > begin > >> by > >>>> saying > >>>> > >>>> "If not already so encoded, the strings are converted into a > >>>> sequence of > >>>> bytes using the UTF-8 encoding." > >>>> > >>>> Again, we can extend this a little to say that the data should > >> be > >>>> Fully > >>>> Normalized to Form C. > >>> Again, don't say "fully" and probably not UTF-8. > >>> > >>>>> There are other issues related to working with IRIs. As a > >> result > >>>> of examining > >>>> this, we propose to write as soon as practical a guideline > >> document > >>>> that > >>>> will > >>>> be incorporated into Character Model (in [4]) that will help > >> your > >>>> group and > >>>> others to act as a reference for this sort of complex IRI > >> parsing > >>>> in the > >>>> future. > >>>> We would like to know if this will help you and how best to > >>>> coordinate our > >>>> actions with your needs in this area. > >>>> > >>>> That would certainly be most helpful - passing detail off to > the > >>>> experts > >>>> is generally a good idea! My worry is one of process. The > >> CHARMOD > >>>> doc is > >>>> a working draft dated 2005 - and we're heading for CR this > >> month > >>>> with > >>>> Rec expected by year end (when our charter runs out). > >>> Yes, understood. We don't want to be a blocker. We are > proposing > >> to prepare a document that is later incorporated into CHARMOD- > NORM > >> so that you have a reference sooner and which I expect we would > >> publish to Note status. In the meantime, we will help you get > the > >> text you need in place. > >>>> In view of the stages along the Rec Track that the documents > are > >>>> currently at, and likely to be at, we may have to refer to the > >>>> CHARMOD > >>>> doc and the guideline you're working on as an extra source of > >>>> useful > >>>> information? > >>> Referring to CHARMOD is always good :-). Note that the > >> Fundamentals part of CHARMOD is a REC and has valuable > information > >> in it. http://www.w3.org/TR/charmod > >>> > >> -- > >> Phil Archer > >> w. http://philarcher.org/ > > -- > Phil Archer > w. http://philarcher.org/
Received on Tuesday, 20 January 2009 23:38:14 UTC