RE: comment: powder grouping handling of IRIs... from Phillips, Addison on 2008-10-01 (public-i18n-core@w3.org from October to December 2008)

From: Phillips, Addison <addison@amazon.com>
Date: Wed, 1 Oct 2008 09:06:01 -0700
To: Phil Archer <parcher@fosi.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
CC: Public POWDER <public-powderwg@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA014C65BB08@EX-SEA5-D.ant.amazon.com>
Hi Phil,

Thanks for the response. Some personal comments follow.

Addison

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.


> 
> Many thanks to you and the i18n WG for taking the time and trouble
> to
> look at our document. The problem of IRI canonicalisation was
> raised by
> Thomas Roessler [1] and Eric P [2] following earlier drafts with
> further
> comments from others - all very welcome. If one thing was clear
> after
> those discussions it's that IRI canonicalisation is a difficult
> thing
> and touches on issues well beyond the scope and expertise of the
> POWDER WG.

It may not be quite *that* difficult, but it needs to be better documented. Individual specs like POWDER shouldn't have to invent it each time.

> 
> Well, we do say just above the bullet point you quote that "If not
> already so encoded, the IRI/URI character string is converted into
> a sequence of bytes using the UTF-8 encoding." 

I... saw that and thought "oh, good, UTF-8", but on second thought...... Are you sure you mean "sequence of bytes" here? Maybe you should say "sequence of Unicode characters" ([sic] code points) instead. The particular Unicode encoding used to encode the characters is a matter for the implementation and the regular expression stuff works just as well if not better with characters.

> >
> > The document also fails to mention a normalization step to ensure
> that the IRI is in
> some Unicode normalization form. If percent-escapes are decoded, we
> theorize that the proper
> thing to do would be to normalize to Form C before parsing into
> tokens.
> This would help ensure
> that tokens are 'include-normalized' (although it would not
> guarantee
> that fact).
> 
> OK, tokenisation refers to the data not the IRI - I'll come to that.

Yes, but you extract IRI components in this section. That's really what I meant.

> >
> > We also note that there are several mentions in this section of
> mapping host parts to lowercase.
> Casefolding is applied to IDNA names, but it is not as simple an
> operation as for ASCII domain names.
> 
> OK, I'm obviously trying to make sure that we don't say anything
> that is
> incorrect or ambiguous so we need to do more here. At present, the
> whole
> section begins thus:
> 
> "Before any IRI or URI matching can take place the following
> canonicalization steps should be applied to the candidate
> resource's IRI
> or URI. These steps are consistent with RFC3986 [URIS], RFC3987
> [IRIS],
> URISpace [URISpace] and XForms [XFORMS]."
> 
> Would this be more appropriate:
> 
> Before any IRI matching can take place the candidate resource's IRI
> should be Fully Normalized to Form C, as defined in Character Model

s/Fully//

> for
> the World Wide Web 1.0: Normalization [CHARMOD] using utf-8. The

s/using utf-8//

> following further steps should then be carried out which are
> consistent
> with RFC3986 [URIS], RFC3987 [IRIS], URISpace [URISpace] and XForms
> [XFORMS].
> 
> AND modify the line about schemes and hosts to say
> 
> The scheme and host are case insensitive but the canonical form of
> both
> *(for ascii characters)* is lower case. Therefore *ascii
> characters* in
> these components in the candidate URI/IRI are normalized to lower
> case.

The non-ASCII characters are also normalized to lowercase (where applicable) during STRINGPREP before Punycode is applied. However, Punycode can encode *any* Unicode character sequence, not just ones that have been stringprepped. 

> 
> 
> Later, in section 2.1.4 which deals with data encoding, we begin by
> saying
> 
> "If not already so encoded, the strings are converted into a
> sequence of
> bytes using the UTF-8 encoding."
> 
> Again, we can extend this a little to say that the data should be
> Fully
> Normalized to Form C.

Again, don't say "fully" and probably not UTF-8.

> 
> >
> > There are other issues related to working with IRIs. As a result
> of examining
> this, we propose to write as soon as practical a guideline document
> that
> will
> be incorporated into Character Model (in [4]) that will help your
> group and
> others to act as a reference for this sort of complex IRI parsing
> in the
> future.
> We would like to know if this will help you and how best to
> coordinate our
> actions with your needs in this area.
> 
> That would certainly be most helpful - passing detail off to the
> experts
> is generally a good idea! My worry is one of process. The CHARMOD
> doc is
>   a working draft dated 2005 - and we're heading for CR this month
> with
> Rec expected by year end (when our charter runs out).

Yes, understood. We don't want to be a blocker. We are proposing to prepare a document that is later incorporated into CHARMOD-NORM so that you have a reference sooner and which I expect we would publish to Note status. In the meantime, we will help you get the text you need in place.

> 
> In view of the stages along the Rec Track that the documents are
> currently at, and likely to be at, we may have to refer to the
> CHARMOD
> doc and the guideline you're working on as an extra source of
> useful
> information?

Referring to CHARMOD is always good :-). Note that the Fundamentals part of CHARMOD is a REC and has valuable information in it. http://www.w3.org/TR/charmod
Received on Wednesday, 1 October 2008 16:06:41 UTC