RE: comment: powder grouping handling of IRIs...

Hi Phil,

Looking better, but still not quite there.

Normalization to Form C is an operation that has to take into account the other operations you're doing. I think that the proper sequence of steps should probably be:

1. If not already so encoded, convert the IRI to a sequence of Unicode characters.
2. Unescape any percent-encoded triples.
3. Normalize the string to Unicode Normalization Form C (NFC).
// the remaining steps remain

It's important to do it in this order because the each of the preceding two steps might produce a non-normalized result. For example, the sequence "%cc%80" encodes the character U+0300 (a combining mark in Unicode). If you unescape this *after* applying Form C, you might end up with a non-normalized character sequence. For example, "E%cc%80" would end up being U+0045 U+0300 instead of the normalized U+00C8 (È).

> Incidentally, I have added your name to the acknowledgements (so
> you get some of the blame if it's wrong ;-) )

Many thanks: I live for blame.

Addison

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.


> -----Original Message-----
> From: Phil Archer [mailto:phil@philarcher.org]
> Sent: Tuesday, January 20, 2009 1:52 AM
> To: Phillips, Addison
> Cc: public-i18n-core@w3.org; Public POWDER
> Subject: Re: comment: powder grouping handling of IRIs...
> 
> Thanks very much indeed, Addison.
> 
> I've switched things around and tried to reduce the number of
> superfluous mentions of URIs. The results are now at [1].
> 
> I have one further question, born of my unfamiliarity with the
> issues
> under discussion. The opening sentences of the section currently
> say:
> 
> Before any IRI matching can take place the candidate resource's IRI
> should be normalized to Form C, as defined in Character Model for
> the
> World Wide Web 1.0: Normalization [CHARMOD-NORM]. The following
> further
> steps should then be carried out...
> 
> Which of these is true please:
> 
> 1. One normalises the string to Form C and then carries out the
> further
> steps as described (in which case the current text is correct).
> 
> 2. By carrying out the steps one normalises the string to Form C
> (in
> which case the current text needs a slight amendment)
> 
> I /think/ 2 is correct? but I'm just not sure enough to make the
> change.
> 
> Incidentally, I have added your name to the acknowledgements (so
> you get
> some of the blame if it's wrong ;-) )
> 
> Thanks again
> 
> Phil.
> 
> 
> [1] http://philarcher.org/powder/grouping/20090120.html#canon

> Disclaimer: please note that this is a temporary URI and this is
> not an
> official W3C publication.
> 
> Phillips, Addison wrote:
> > Hi Phil, (this is a personal reply not--or at least not-yet--
> endorsed by the I18N WG)
> >
> > Thanks for the note. This looks pretty good. However, I do have
> some comments.
> >
> > 1. The case and path normalization steps occur before the IRI is
> converted to Unicode, Unicode-normalized, and percent escapes
> removed. This should be reversed. For example, both %c3%80 and
> %C3%80 represent the uppercase letter 'À'. I think the intention is
> to normalize the unescaped characters rather than the escapes.
> Further, not removing escapes and converting to Unicode may expose
> security flaws in processing (where escaped values should have been
> normalized and produce false matches, additional trailing dots/path
> elements, etc.). You do cover the %2F case, which is good.
> >
> > 2. You have a step for case normalizing portions of the IRI
> (particularly the host). Case normalization is locale-sensitive and
> is not limited to non-ASCII characters. See [1]. So where it says:
> >
> > --
> > Therefore characters in these components in the candidate URI/IRI
> are normalized to lower case where applicable.
> > --
> >
> > I wound recommend that you say:
> >
> > --
> > Therefore, where applicable, characters in these components in
> the candidate URI/IRI are normalized to lower case using the
> default Unicode case mapping.
> > --
> >
> > 3. Observation: although the text talks about "URI/IRI"
> consistently as a pairing, the net result of your algorithm is
> conversion of all URIs to IRIs and the processing is as an IRI
> after that. It might be useful to acknowledge this. There is no
> reason to flip back-and-forth between RFCs 3986 and 3987.
> >
> >
> > This is just a peek on my part. Hope this helps.
> >
> > Addison
> >
> > [1] http://www.w3.org/International/wiki/Case_folding

> >
> > Addison Phillips
> > Globalization Architect -- Lab126
> >
> > Internationalization is not a feature.
> > It is an architecture.
> >
> >
> >> -----Original Message-----
> >> From: Phil Archer [mailto:phil@philarcher.org]
> >> Sent: Monday, January 19, 2009 6:07 AM
> >> To: Phillips, Addison
> >> Cc: public-i18n-core@w3.org; Public POWDER
> >> Subject: Re: comment: powder grouping handling of IRIs...
> >>
> >> Addison,
> >>
> >> This is an old thread but I need to pick it up again.
> >>
> >> I'm in the middle of preparing for a transition request to PR
> and,
> >> as
> >> part of my checks I see that although we did incorporate your
> >> changes in
> >> full, I've not had the decency to let you know this and to check
> >> that
> >> the text is now in accordance with i8n recommendations. So,
> first
> >> of
> >> all, I apologise sincerely for this oversight and downright
> >> rudeness!
> >>
> >> Secondly, may I ask you please to take a quick peek at an
> >> unofficial,
> >> not published by the W3C, editors' draft of the relevant section
> at
> >> [1].
> >> The aim was to use exactly your words and suggestions. I hope we
> >> got it
> >> right?
> >>
> >> Thank you. And, again, sincere apologies for not sending this
> weeks
> >> ago
> >>
> >> Phil.
> >>
> >> [1] http://philarcher.org/powder/grouping/20090107.html#canon

> >>
> >>
> >> Phillips, Addison wrote:
> >>> Hi Phil,
> >>>
> >>> Thanks for the response. Some personal comments follow.
> >>>
> >>> Addison
> >>>
> >>> Addison Phillips
> >>> Globalization Architect -- Lab126
> >>>
> >>> Internationalization is not a feature.
> >>> It is an architecture.
> >>>
> >>>
> >>>> Many thanks to you and the i18n WG for taking the time and
> >> trouble
> >>>> to
> >>>> look at our document. The problem of IRI canonicalisation was
> >>>> raised by
> >>>> Thomas Roessler [1] and Eric P [2] following earlier drafts
> with
> >>>> further
> >>>> comments from others - all very welcome. If one thing was
> clear
> >>>> after
> >>>> those discussions it's that IRI canonicalisation is a
> difficult
> >>>> thing
> >>>> and touches on issues well beyond the scope and expertise of
> the
> >>>> POWDER WG.
> >>> It may not be quite *that* difficult, but it needs to be better
> >> documented. Individual specs like POWDER shouldn't have to
> invent
> >> it each time.
> >>>> Well, we do say just above the bullet point you quote that "If
> >> not
> >>>> already so encoded, the IRI/URI character string is converted
> >> into
> >>>> a sequence of bytes using the UTF-8 encoding."
> >>> I... saw that and thought "oh, good, UTF-8", but on second
> >> thought...... Are you sure you mean "sequence of bytes" here?
> Maybe
> >> you should say "sequence of Unicode characters" ([sic] code
> points)
> >> instead. The particular Unicode encoding used to encode the
> >> characters is a matter for the implementation and the regular
> >> expression stuff works just as well if not better with
> characters.
> >>>>> The document also fails to mention a normalization step to
> >> ensure
> >>>> that the IRI is in
> >>>> some Unicode normalization form. If percent-escapes are
> decoded,
> >> we
> >>>> theorize that the proper
> >>>> thing to do would be to normalize to Form C before parsing
> into
> >>>> tokens.
> >>>> This would help ensure
> >>>> that tokens are 'include-normalized' (although it would not
> >>>> guarantee
> >>>> that fact).
> >>>>
> >>>> OK, tokenisation refers to the data not the IRI - I'll come to
> >> that.
> >>> Yes, but you extract IRI components in this section. That's
> >> really what I meant.
> >>>>> We also note that there are several mentions in this section
> of
> >>>> mapping host parts to lowercase.
> >>>> Casefolding is applied to IDNA names, but it is not as simple
> an
> >>>> operation as for ASCII domain names.
> >>>>
> >>>> OK, I'm obviously trying to make sure that we don't say
> anything
> >>>> that is
> >>>> incorrect or ambiguous so we need to do more here. At present,
> >> the
> >>>> whole
> >>>> section begins thus:
> >>>>
> >>>> "Before any IRI or URI matching can take place the following
> >>>> canonicalization steps should be applied to the candidate
> >>>> resource's IRI
> >>>> or URI. These steps are consistent with RFC3986 [URIS],
> RFC3987
> >>>> [IRIS],
> >>>> URISpace [URISpace] and XForms [XFORMS]."
> >>>>
> >>>> Would this be more appropriate:
> >>>>
> >>>> Before any IRI matching can take place the candidate
> resource's
> >> IRI
> >>>> should be Fully Normalized to Form C, as defined in Character
> >> Model
> >>> s/Fully//
> >>>
> >>>> for
> >>>> the World Wide Web 1.0: Normalization [CHARMOD] using utf-8.
> The
> >>> s/using utf-8//
> >>>
> >>>> following further steps should then be carried out which are
> >>>> consistent
> >>>> with RFC3986 [URIS], RFC3987 [IRIS], URISpace [URISpace] and
> >> XForms
> >>>> [XFORMS].
> >>>>
> >>>> AND modify the line about schemes and hosts to say
> >>>>
> >>>> The scheme and host are case insensitive but the canonical
> form
> >> of
> >>>> both
> >>>> *(for ascii characters)* is lower case. Therefore *ascii
> >>>> characters* in
> >>>> these components in the candidate URI/IRI are normalized to
> >> lower
> >>>> case.
> >>> The non-ASCII characters are also normalized to lowercase
> (where
> >> applicable) during STRINGPREP before Punycode is applied.
> However,
> >> Punycode can encode *any* Unicode character sequence, not just
> ones
> >> that have been stringprepped.
> >>>> Later, in section 2.1.4 which deals with data encoding, we
> begin
> >> by
> >>>> saying
> >>>>
> >>>> "If not already so encoded, the strings are converted into a
> >>>> sequence of
> >>>> bytes using the UTF-8 encoding."
> >>>>
> >>>> Again, we can extend this a little to say that the data should
> >> be
> >>>> Fully
> >>>> Normalized to Form C.
> >>> Again, don't say "fully" and probably not UTF-8.
> >>>
> >>>>> There are other issues related to working with IRIs. As a
> >> result
> >>>> of examining
> >>>> this, we propose to write as soon as practical a guideline
> >> document
> >>>> that
> >>>> will
> >>>> be incorporated into Character Model (in [4]) that will help
> >> your
> >>>> group and
> >>>> others to act as a reference for this sort of complex IRI
> >> parsing
> >>>> in the
> >>>> future.
> >>>> We would like to know if this will help you and how best to
> >>>> coordinate our
> >>>> actions with your needs in this area.
> >>>>
> >>>> That would certainly be most helpful - passing detail off to
> the
> >>>> experts
> >>>> is generally a good idea! My worry is one of process. The
> >> CHARMOD
> >>>> doc is
> >>>>   a working draft dated 2005 - and we're heading for CR this
> >> month
> >>>> with
> >>>> Rec expected by year end (when our charter runs out).
> >>> Yes, understood. We don't want to be a blocker. We are
> proposing
> >> to prepare a document that is later incorporated into CHARMOD-
> NORM
> >> so that you have a reference sooner and which I expect we would
> >> publish to Note status. In the meantime, we will help you get
> the
> >> text you need in place.
> >>>> In view of the stages along the Rec Track that the documents
> are
> >>>> currently at, and likely to be at, we may have to refer to the
> >>>> CHARMOD
> >>>> doc and the guideline you're working on as an extra source of
> >>>> useful
> >>>> information?
> >>> Referring to CHARMOD is always good :-). Note that the
> >> Fundamentals part of CHARMOD is a REC and has valuable
> information
> >> in it. http://www.w3.org/TR/charmod

> >>>
> >> --
> >> Phil Archer
> >> w. http://philarcher.org/

> 
> --
> Phil Archer
> w. http://philarcher.org/

Received on Tuesday, 20 January 2009 23:38:14 UTC