Re: comment: powder grouping handling of IRIs...

Addison,

Many thanks to you and the i18n WG for taking the time and trouble to 
look at our document. The problem of IRI canonicalisation was raised by 
Thomas Roessler [1] and Eric P [2] following earlier drafts with further 
comments from others - all very welcome. If one thing was clear after 
those discussions it's that IRI canonicalisation is a difficult thing 
and touches on issues well beyond the scope and expertise of the POWDER WG.

As a result of that, we did our best to say in the LC draft that a) this 
is a complicated issue and b) that precisely how an IRI should be 
matched is context-dependent (network layer, browser layer etc.). 
Therefore, section 2.1.3.3 (Further Steps, [3]) is very much more fuzzy 
than the previous lines.

Let me take your comments line by line and see where we get to.

> 
> During our most recent teleconference, we reviewed your Last Call document [1] on POWDER Grouping of Resources. I am writing as a result of our discussion [2]. We recognize that your last call ended on the 14th and apologize for sending you comments after that date.

No worries, we're still working through comments and yours are very welcome.
> 
> The Internationalization WG is particularly concerned about Section 2.1.3, in which IRI 
canonicalization is handled. This section concerns us, in part, because 
there are some
corner cases not handled and some issues we hadn't maybe documented very 
well in the past.

(more likely we've not done as much homework as we should ;-))

In particular, it isn't clear if the IRI text is normalized to any 
particular Unicode
Normalization Form [cf. 3] and when this conversion occurs in the 
tokenization process.

> 
> Also, there is a step dealing with percent-encoded values that reads:
> 
> --
> Percent encoded triples are converted into the characters they represent (e.g. %c3%a7 becomes รง etc.).
> --
> 
> This presupposes that all percent-encoded sequences represent a UTF-8 byte sequence, which may not be correct. 
It also omits mention of non-shortest form UTF-8 or the encoding of pure 
byte values (the former is illegal
and is a security risk, the latter is permitted by RFC 3987 and exists 
as a corner case).


Well, we do say just above the bullet point you quote that "If not 
already so encoded, the IRI/URI character string is converted into a 
sequence of bytes using the UTF-8 encoding." But looking at the 
documents you've referred us to, it looks as if we need to do more.
> 
> The document also fails to mention a normalization step to ensure that the IRI is in 
some Unicode normalization form. If percent-escapes are decoded, we 
theorize that the proper
thing to do would be to normalize to Form C before parsing into tokens. 
This would help ensure
that tokens are 'include-normalized' (although it would not guarantee 
that fact).

OK, tokenisation refers to the data not the IRI - I'll come to that.
> 
> We also note that there are several mentions in this section of mapping host parts to lowercase. 
Casefolding is applied to IDNA names, but it is not as simple an 
operation as for ASCII domain names.

OK, I'm obviously trying to make sure that we don't say anything that is 
incorrect or ambiguous so we need to do more here. At present, the whole 
section begins thus:

"Before any IRI or URI matching can take place the following 
canonicalization steps should be applied to the candidate resource's IRI 
or URI. These steps are consistent with RFC3986 [URIS], RFC3987 [IRIS], 
URISpace [URISpace] and XForms [XFORMS]."

Would this be more appropriate:

Before any IRI matching can take place the candidate resource's IRI 
should be Fully Normalized to Form C, as defined in Character Model for 
the World Wide Web 1.0: Normalization [CHARMOD] using utf-8. The 
following further steps should then be carried out which are consistent 
with RFC3986 [URIS], RFC3987 [IRIS], URISpace [URISpace] and XForms 
[XFORMS].

AND modify the line about schemes and hosts to say

The scheme and host are case insensitive but the canonical form of both 
*(for ascii characters)* is lower case. Therefore *ascii characters* in 
these components in the candidate URI/IRI are normalized to lower case.


Later, in section 2.1.4 which deals with data encoding, we begin by saying

"If not already so encoded, the strings are converted into a sequence of 
bytes using the UTF-8 encoding."

Again, we can extend this a little to say that the data should be Fully 
Normalized to Form C.

> 
> There are other issues related to working with IRIs. As a result of examining 
this, we propose to write as soon as practical a guideline document that 
will
be incorporated into Character Model (in [4]) that will help your group and
others to act as a reference for this sort of complex IRI parsing in the 
future.
We would like to know if this will help you and how best to coordinate our
actions with your needs in this area.

That would certainly be most helpful - passing detail off to the experts 
is generally a good idea! My worry is one of process. The CHARMOD doc is 
  a working draft dated 2005 - and we're heading for CR this month with 
Rec expected by year end (when our charter runs out).

In view of the stages along the Rec Track that the documents are 
currently at, and likely to be at, we may have to refer to the CHARMOD 
doc and the guideline you're working on as an extra source of useful 
information?

Phil.


[1] http://lists.w3.org/Archives/Public/public-powderwg/2007Nov/0012.html
[2] http://lists.w3.org/Archives/Public/public-powderwg/2008Feb/0003.html
[3] http://www.w3.org/TR/2008/WD-powder-grouping-20080815/#more-canon


-- 
Phil Archer
Chief Technical Officer,
Family Online Safety Institute
w. http://www.fosi.org/people/philarcher/

Register now for the annual Family Online Safety Institute Conference 
and Exhibition, December 11th, 2008, Washington, DC.
See http://www.fosi.org/conference2008/

Received on Wednesday, 1 October 2008 10:13:03 UTC