Re: comment: powder grouping handling of IRIs... from Phil Archer on 2008-10-03 (public-powderwg@w3.org from October 2008)

From: Phil Archer <parcher@fosi.org>
Date: Fri, 03 Oct 2008 12:37:27 +0100
To: "Phillips, Addison" <addison@amazon.com>
CC: "public-i18n-core@w3.org" <public-i18n-core@w3.org>, Public POWDER <public-powderwg@w3.org>
Message-ID: <48E603F7.3050904@fosi.org>
Phillips, Addison wrote:
> Hi Phil,
> 
> Thanks for the response. Some personal comments follow.
> 
> Addison
> 
> Addison Phillips
> Globalization Architect -- Lab126
> 
> Internationalization is not a feature.
> It is an architecture.
> 
> 
>> Many thanks to you and the i18n WG for taking the time and trouble
>> to
>> look at our document. The problem of IRI canonicalisation was
>> raised by
>> Thomas Roessler [1] and Eric P [2] following earlier drafts with
>> further
>> comments from others - all very welcome. If one thing was clear
>> after
>> those discussions it's that IRI canonicalisation is a difficult
>> thing
>> and touches on issues well beyond the scope and expertise of the
>> POWDER WG.
> 
> It may not be quite *that* difficult, but it needs to be better documented. Individual specs like POWDER shouldn't have to invent it each time.
> 
>> Well, we do say just above the bullet point you quote that "If not
>> already so encoded, the IRI/URI character string is converted into
>> a sequence of bytes using the UTF-8 encoding." 
> 
> I... saw that and thought "oh, good, UTF-8", but on second thought...... Are you sure you mean "sequence of bytes" here? Maybe you should say "sequence of Unicode characters" ([sic] code points) instead. The particular Unicode encoding used to encode the characters is a matter for the implementation and the regular expression stuff works just as well if not better with characters.

I got that phrase 'sequence of bytes' from somewhere else but I'm blowed 
if I can find it now. OK, sequence of characters seems perfectly 
reasonable. That's the phrase used in RFC 3987...

> 
>>> The document also fails to mention a normalization step to ensure
>> that the IRI is in
>> some Unicode normalization form. If percent-escapes are decoded, we
>> theorize that the proper
>> thing to do would be to normalize to Form C before parsing into
>> tokens.
>> This would help ensure
>> that tokens are 'include-normalized' (although it would not
>> guarantee
>> that fact).
>>
>> OK, tokenisation refers to the data not the IRI - I'll come to that.
> 
> Yes, but you extract IRI components in this section. That's really what I meant.

I see, yes, you're right.

> 
>>> We also note that there are several mentions in this section of
>> mapping host parts to lowercase.
>> Casefolding is applied to IDNA names, but it is not as simple an
>> operation as for ASCII domain names.
>>
>> OK, I'm obviously trying to make sure that we don't say anything
>> that is
>> incorrect or ambiguous so we need to do more here. At present, the
>> whole
>> section begins thus:
>>
>> "Before any IRI or URI matching can take place the following
>> canonicalization steps should be applied to the candidate
>> resource's IRI
>> or URI. These steps are consistent with RFC3986 [URIS], RFC3987
>> [IRIS],
>> URISpace [URISpace] and XForms [XFORMS]."
>>
>> Would this be more appropriate:
>>
>> Before any IRI matching can take place the candidate resource's IRI
>> should be Fully Normalized to Form C, as defined in Character Model
> 
> s/Fully//

Noted

> 
>> for
>> the World Wide Web 1.0: Normalization [CHARMOD] using utf-8. The
> 
> s/using utf-8//

OK

> 
>> following further steps should then be carried out which are
>> consistent
>> with RFC3986 [URIS], RFC3987 [IRIS], URISpace [URISpace] and XForms
>> [XFORMS].
>>
>> AND modify the line about schemes and hosts to say
>>
>> The scheme and host are case insensitive but the canonical form of
>> both
>> *(for ascii characters)* is lower case. Therefore *ascii
>> characters* in
>> these components in the candidate URI/IRI are normalized to lower
>> case.
> 
> The non-ASCII characters are also normalized to lowercase (where applicable) during STRINGPREP before Punycode is applied. However, Punycode can encode *any* Unicode character sequence, not just ones that have been stringprepped. 

So how's this:

The scheme and host are case insensitive but the canonical form of both
is lower case. Therefore characters in these components in the candidate 
URI/IRI are normalized to lower case where applicable.

But are IDNs case-sensitive? Is http://www.xn--exmple-jua.org/ different 
from http://www.xn--exmple-JUA.org/?

Sorry, I'm being lazy and asking you rather than boning up on Punycode.

> 
>>
>> Later, in section 2.1.4 which deals with data encoding, we begin by
>> saying
>>
>> "If not already so encoded, the strings are converted into a
>> sequence of
>> bytes using the UTF-8 encoding."
>>
>> Again, we can extend this a little to say that the data should be
>> Fully
>> Normalized to Form C.
> 
> Again, don't say "fully" and probably not UTF-8.

OK

> 
>>> There are other issues related to working with IRIs. As a result
>> of examining
>> this, we propose to write as soon as practical a guideline document
>> that
>> will
>> be incorporated into Character Model (in [4]) that will help your
>> group and
>> others to act as a reference for this sort of complex IRI parsing
>> in the
>> future.
>> We would like to know if this will help you and how best to
>> coordinate our
>> actions with your needs in this area.
>>
>> That would certainly be most helpful - passing detail off to the
>> experts
>> is generally a good idea! My worry is one of process. The CHARMOD
>> doc is
>>   a working draft dated 2005 - and we're heading for CR this month
>> with
>> Rec expected by year end (when our charter runs out).
> 
> Yes, understood. We don't want to be a blocker. We are proposing to prepare a document that is later incorporated into CHARMOD-NORM so that you have a reference sooner and which I expect we would publish to Note status. In the meantime, we will help you get the text you need in place.

Thank you. I'll do some work to incorporate your comments as soon as I 
can and get at least that section to you.

> 
>> In view of the stages along the Rec Track that the documents are
>> currently at, and likely to be at, we may have to refer to the
>> CHARMOD
>> doc and the guideline you're working on as an extra source of
>> useful
>> information?
> 
> Referring to CHARMOD is always good :-). Note that the Fundamentals part of CHARMOD is a REC and has valuable information in it. http://www.w3.org/TR/charmod

I'll read that properly before editing anything!

Thanks again

Phil.
Received on Friday, 3 October 2008 11:38:02 UTC