- From: Phillips, Addison <addison@amazon.com>
- Date: Wed, 1 Oct 2008 09:06:01 -0700
- To: Phil Archer <parcher@fosi.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
- CC: Public POWDER <public-powderwg@w3.org>
Hi Phil, Thanks for the response. Some personal comments follow. Addison Addison Phillips Globalization Architect -- Lab126 Internationalization is not a feature. It is an architecture. > > Many thanks to you and the i18n WG for taking the time and trouble > to > look at our document. The problem of IRI canonicalisation was > raised by > Thomas Roessler [1] and Eric P [2] following earlier drafts with > further > comments from others - all very welcome. If one thing was clear > after > those discussions it's that IRI canonicalisation is a difficult > thing > and touches on issues well beyond the scope and expertise of the > POWDER WG. It may not be quite *that* difficult, but it needs to be better documented. Individual specs like POWDER shouldn't have to invent it each time. > > Well, we do say just above the bullet point you quote that "If not > already so encoded, the IRI/URI character string is converted into > a sequence of bytes using the UTF-8 encoding." I... saw that and thought "oh, good, UTF-8", but on second thought...... Are you sure you mean "sequence of bytes" here? Maybe you should say "sequence of Unicode characters" ([sic] code points) instead. The particular Unicode encoding used to encode the characters is a matter for the implementation and the regular expression stuff works just as well if not better with characters. > > > > The document also fails to mention a normalization step to ensure > that the IRI is in > some Unicode normalization form. If percent-escapes are decoded, we > theorize that the proper > thing to do would be to normalize to Form C before parsing into > tokens. > This would help ensure > that tokens are 'include-normalized' (although it would not > guarantee > that fact). > > OK, tokenisation refers to the data not the IRI - I'll come to that. Yes, but you extract IRI components in this section. That's really what I meant. > > > > We also note that there are several mentions in this section of > mapping host parts to lowercase. > Casefolding is applied to IDNA names, but it is not as simple an > operation as for ASCII domain names. > > OK, I'm obviously trying to make sure that we don't say anything > that is > incorrect or ambiguous so we need to do more here. At present, the > whole > section begins thus: > > "Before any IRI or URI matching can take place the following > canonicalization steps should be applied to the candidate > resource's IRI > or URI. These steps are consistent with RFC3986 [URIS], RFC3987 > [IRIS], > URISpace [URISpace] and XForms [XFORMS]." > > Would this be more appropriate: > > Before any IRI matching can take place the candidate resource's IRI > should be Fully Normalized to Form C, as defined in Character Model s/Fully// > for > the World Wide Web 1.0: Normalization [CHARMOD] using utf-8. The s/using utf-8// > following further steps should then be carried out which are > consistent > with RFC3986 [URIS], RFC3987 [IRIS], URISpace [URISpace] and XForms > [XFORMS]. > > AND modify the line about schemes and hosts to say > > The scheme and host are case insensitive but the canonical form of > both > *(for ascii characters)* is lower case. Therefore *ascii > characters* in > these components in the candidate URI/IRI are normalized to lower > case. The non-ASCII characters are also normalized to lowercase (where applicable) during STRINGPREP before Punycode is applied. However, Punycode can encode *any* Unicode character sequence, not just ones that have been stringprepped. > > > Later, in section 2.1.4 which deals with data encoding, we begin by > saying > > "If not already so encoded, the strings are converted into a > sequence of > bytes using the UTF-8 encoding." > > Again, we can extend this a little to say that the data should be > Fully > Normalized to Form C. Again, don't say "fully" and probably not UTF-8. > > > > > There are other issues related to working with IRIs. As a result > of examining > this, we propose to write as soon as practical a guideline document > that > will > be incorporated into Character Model (in [4]) that will help your > group and > others to act as a reference for this sort of complex IRI parsing > in the > future. > We would like to know if this will help you and how best to > coordinate our > actions with your needs in this area. > > That would certainly be most helpful - passing detail off to the > experts > is generally a good idea! My worry is one of process. The CHARMOD > doc is > a working draft dated 2005 - and we're heading for CR this month > with > Rec expected by year end (when our charter runs out). Yes, understood. We don't want to be a blocker. We are proposing to prepare a document that is later incorporated into CHARMOD-NORM so that you have a reference sooner and which I expect we would publish to Note status. In the meantime, we will help you get the text you need in place. > > In view of the stages along the Rec Track that the documents are > currently at, and likely to be at, we may have to refer to the > CHARMOD > doc and the guideline you're working on as an extra source of > useful > information? Referring to CHARMOD is always good :-). Note that the Fundamentals part of CHARMOD is a REC and has valuable information in it. http://www.w3.org/TR/charmod
Received on Wednesday, 1 October 2008 16:06:45 UTC