Re: comment: powder grouping handling of IRIs...

Agh! Now I see what the problem was with the first e-mail Addison sent. 
here is my reply again for the record (with apologies for multiple 
postings).

Addison,

Many thanks to you and the i18n WG for taking the time and trouble to
look at our document. The problem of IRI canonicalisation was raised by
Thomas Roessler [1] and Eric P [2] following earlier drafts with further
comments from others - all very welcome. If one thing was clear after
those discussions it's that IRI canonicalisation is a difficult thing
and touches on issues well beyond the scope and expertise of the POWDER WG.

As a result of that, we did our best to say in the LC draft that a) this
is a complicated issue and b) that precisely how an IRI should be
matched is context-dependent (network layer, browser layer etc.).
Therefore, section 2.1.3.3 (Further Steps, [3]) is very much more fuzzy
than the previous lines.

Let me take your comments line by line and see where we get to.

> 
> During our most recent teleconference, we reviewed your Last Call document [1] on POWDER Grouping of Resources. I am writing as a result of our discussion [2]. We recognize that your last call ended on the 14th and apologize for sending you comments after that date.

No worries, we're still working through comments and yours are very welcome.
> 
> The Internationalization WG is particularly concerned about Section 2.1.3, in which IRI 
canonicalization is handled. This section concerns us, in part, because
there are some
corner cases not handled and some issues we hadn't maybe documented very
well in the past.

(more likely we've not done as much homework as we should ;-))

In particular, it isn't clear if the IRI text is normalized to any
particular Unicode
Normalization Form [cf. 3] and when this conversion occurs in the
tokenization process.

> 
> Also, there is a step dealing with percent-encoded values that reads:
> 
> --
> Percent encoded triples are converted into the characters they represent (e.g. %c3%a7 becomes รง etc.).
> --
> 
> This presupposes that all percent-encoded sequences represent a UTF-8 byte sequence, which may not be correct. 
It also omits mention of non-shortest form UTF-8 or the encoding of pure
byte values (the former is illegal
and is a security risk, the latter is permitted by RFC 3987 and exists
as a corner case).


Well, we do say just above the bullet point you quote that "If not
already so encoded, the IRI/URI character string is converted into a
sequence of bytes using the UTF-8 encoding." But looking at the
documents you've referred us to, it looks as if we need to do more.
> 
> The document also fails to mention a normalization step to ensure that the IRI is in 
some Unicode normalization form. If percent-escapes are decoded, we
theorize that the proper
thing to do would be to normalize to Form C before parsing into tokens.
This would help ensure
that tokens are 'include-normalized' (although it would not guarantee
that fact).

OK, tokenisation refers to the data not the IRI - I'll come to that.
> 
> We also note that there are several mentions in this section of mapping host parts to lowercase. 
Casefolding is applied to IDNA names, but it is not as simple an
operation as for ASCII domain names.

OK, I'm obviously trying to make sure that we don't say anything that is
incorrect or ambiguous so we need to do more here. At present, the whole
section begins thus:

"Before any IRI or URI matching can take place the following
canonicalization steps should be applied to the candidate resource's IRI
or URI. These steps are consistent with RFC3986 [URIS], RFC3987 [IRIS],
URISpace [URISpace] and XForms [XFORMS]."

Would this be more appropriate:

Before any IRI matching can take place the candidate resource's IRI
should be Fully Normalized to Form C, as defined in Character Model for
the World Wide Web 1.0: Normalization [CHARMOD] using utf-8. The
following further steps should then be carried out which are consistent
with RFC3986 [URIS], RFC3987 [IRIS], URISpace [URISpace] and XForms
[XFORMS].

AND modify the line about schemes and hosts to say

The scheme and host are case insensitive but the canonical form of both
*(for ascii characters)* is lower case. Therefore *ascii characters* in
these components in the candidate URI/IRI are normalized to lower case.


Later, in section 2.1.4 which deals with data encoding, we begin by saying

"If not already so encoded, the strings are converted into a sequence of
bytes using the UTF-8 encoding."

Again, we can extend this a little to say that the data should be Fully
Normalized to Form C.

> 
> There are other issues related to working with IRIs. As a result of examining 
this, we propose to write as soon as practical a guideline document that
will
be incorporated into Character Model (in [4]) that will help your group and
others to act as a reference for this sort of complex IRI parsing in the
future.
We would like to know if this will help you and how best to coordinate our
actions with your needs in this area.

That would certainly be most helpful - passing detail off to the experts
is generally a good idea! My worry is one of process. The CHARMOD doc is
  a working draft dated 2005 - and we're heading for CR this month with
Rec expected by year end (when our charter runs out).

In view of the stages along the Rec Track that the documents are
currently at, and likely to be at, we may have to refer to the CHARMOD
doc and the guideline you're working on as an extra source of useful
information?

Phil.


[1] http://lists.w3.org/Archives/Public/public-powderwg/2007Nov/0012.html
[2] http://lists.w3.org/Archives/Public/public-powderwg/2008Feb/0003.html
[3] http://www.w3.org/TR/2008/WD-powder-grouping-20080815/#more-canon


-- 
Phil Archer
Chief Technical Officer,
Family Online Safety Institute
w. http://www.fosi.org/people/philarcher/

Register now for the annual Family Online Safety Institute Conference
and Exhibition, December 11th, 2008, Washington, DC.
See http://www.fosi.org/conference2008/

Received on Wednesday, 1 October 2008 10:18:44 UTC