- From: Phil Archer <parcher@fosi.org>
- Date: Wed, 01 Oct 2008 11:18:10 +0100
- To: public-i18n-core@w3.org
- CC: Public POWDER <public-powderwg@w3.org>, addison@amazon.com
Agh! Now I see what the problem was with the first e-mail Addison sent. here is my reply again for the record (with apologies for multiple postings). Addison, Many thanks to you and the i18n WG for taking the time and trouble to look at our document. The problem of IRI canonicalisation was raised by Thomas Roessler [1] and Eric P [2] following earlier drafts with further comments from others - all very welcome. If one thing was clear after those discussions it's that IRI canonicalisation is a difficult thing and touches on issues well beyond the scope and expertise of the POWDER WG. As a result of that, we did our best to say in the LC draft that a) this is a complicated issue and b) that precisely how an IRI should be matched is context-dependent (network layer, browser layer etc.). Therefore, section 2.1.3.3 (Further Steps, [3]) is very much more fuzzy than the previous lines. Let me take your comments line by line and see where we get to. > > During our most recent teleconference, we reviewed your Last Call document [1] on POWDER Grouping of Resources. I am writing as a result of our discussion [2]. We recognize that your last call ended on the 14th and apologize for sending you comments after that date. No worries, we're still working through comments and yours are very welcome. > > The Internationalization WG is particularly concerned about Section 2.1.3, in which IRI canonicalization is handled. This section concerns us, in part, because there are some corner cases not handled and some issues we hadn't maybe documented very well in the past. (more likely we've not done as much homework as we should ;-)) In particular, it isn't clear if the IRI text is normalized to any particular Unicode Normalization Form [cf. 3] and when this conversion occurs in the tokenization process. > > Also, there is a step dealing with percent-encoded values that reads: > > -- > Percent encoded triples are converted into the characters they represent (e.g. %c3%a7 becomes รง etc.). > -- > > This presupposes that all percent-encoded sequences represent a UTF-8 byte sequence, which may not be correct. It also omits mention of non-shortest form UTF-8 or the encoding of pure byte values (the former is illegal and is a security risk, the latter is permitted by RFC 3987 and exists as a corner case). Well, we do say just above the bullet point you quote that "If not already so encoded, the IRI/URI character string is converted into a sequence of bytes using the UTF-8 encoding." But looking at the documents you've referred us to, it looks as if we need to do more. > > The document also fails to mention a normalization step to ensure that the IRI is in some Unicode normalization form. If percent-escapes are decoded, we theorize that the proper thing to do would be to normalize to Form C before parsing into tokens. This would help ensure that tokens are 'include-normalized' (although it would not guarantee that fact). OK, tokenisation refers to the data not the IRI - I'll come to that. > > We also note that there are several mentions in this section of mapping host parts to lowercase. Casefolding is applied to IDNA names, but it is not as simple an operation as for ASCII domain names. OK, I'm obviously trying to make sure that we don't say anything that is incorrect or ambiguous so we need to do more here. At present, the whole section begins thus: "Before any IRI or URI matching can take place the following canonicalization steps should be applied to the candidate resource's IRI or URI. These steps are consistent with RFC3986 [URIS], RFC3987 [IRIS], URISpace [URISpace] and XForms [XFORMS]." Would this be more appropriate: Before any IRI matching can take place the candidate resource's IRI should be Fully Normalized to Form C, as defined in Character Model for the World Wide Web 1.0: Normalization [CHARMOD] using utf-8. The following further steps should then be carried out which are consistent with RFC3986 [URIS], RFC3987 [IRIS], URISpace [URISpace] and XForms [XFORMS]. AND modify the line about schemes and hosts to say The scheme and host are case insensitive but the canonical form of both *(for ascii characters)* is lower case. Therefore *ascii characters* in these components in the candidate URI/IRI are normalized to lower case. Later, in section 2.1.4 which deals with data encoding, we begin by saying "If not already so encoded, the strings are converted into a sequence of bytes using the UTF-8 encoding." Again, we can extend this a little to say that the data should be Fully Normalized to Form C. > > There are other issues related to working with IRIs. As a result of examining this, we propose to write as soon as practical a guideline document that will be incorporated into Character Model (in [4]) that will help your group and others to act as a reference for this sort of complex IRI parsing in the future. We would like to know if this will help you and how best to coordinate our actions with your needs in this area. That would certainly be most helpful - passing detail off to the experts is generally a good idea! My worry is one of process. The CHARMOD doc is a working draft dated 2005 - and we're heading for CR this month with Rec expected by year end (when our charter runs out). In view of the stages along the Rec Track that the documents are currently at, and likely to be at, we may have to refer to the CHARMOD doc and the guideline you're working on as an extra source of useful information? Phil. [1] http://lists.w3.org/Archives/Public/public-powderwg/2007Nov/0012.html [2] http://lists.w3.org/Archives/Public/public-powderwg/2008Feb/0003.html [3] http://www.w3.org/TR/2008/WD-powder-grouping-20080815/#more-canon -- Phil Archer Chief Technical Officer, Family Online Safety Institute w. http://www.fosi.org/people/philarcher/ Register now for the annual Family Online Safety Institute Conference and Exhibition, December 11th, 2008, Washington, DC. See http://www.fosi.org/conference2008/
Received on Wednesday, 1 October 2008 10:18:44 UTC