RE: Some issues with the IRI document [nfcnfkc-04] from Michel Suignard on 2003-04-16 (public-iri@w3.org from April 2003)

From: Michel Suignard <michelsu@windows.microsoft.com>
Date: Wed, 16 Apr 2003 14:01:54 -0700
To: "Paul Hoffman / IMC" <phoffman@imc.org>, "Martin Duerst" <duerst@w3.org>
Cc: <public-iri@w3.org>
Message-ID: <84DD35E3DD87D5489AC42A59926DABE901032E37@WIN-MSG-10.wingroup.windeploy.ntdev.mi>

| From: Paul Hoffman / IMC [mailto:phoffman@imc.org] 
| At 5:22 PM -0400 4/15/03, Martin Duerst wrote:
| >Overall, the normalization strategy on IRIs varies according to the 
| >place in the URI:
| >
| >- For domain name part: use NFKC or more (i.e. nameprep), but
| >   gets normalized again (with nameprep) when doing dns lookup.
| >- For the path part: preferably NFKC, but NFC is okay when needed.
| >- For the query part: There may be cases where you on purpose
| >   want to use something totally unnormalized (e.g. when submitting
| >   unnormalized data to a CGI script that normalizes).
| >
| >Does that sound reasonable? Do you think it needs any changes in the 
| >draft, and if yes, what would be those changes?
| 
| It doesn't sound reasonable if you intend IRI comparison to be 
| interoperable. If you don't intend IRI comparison to be 
| interoperable, I still would pick one normalization for each of the 
| three parts, and I would pick NFKC, but you don't have to be 
| consistent if interoperability isn't important.
| 
| Am I the only person who worries about IRI comparison being 
| interoperable?

I really think it is a bad idea to try to enforce NFKC on all components
of a IRI string. What is tolerable for a host/domain name is not for
many other components. NFKC removes many subtleties from the character
repertoire that may have to be preserved for some schemes. For all
issues/concerns with NFKC you can check the Unicode TR15
http://www.unicode.org/reports/tr15/. In general, NFC is much better.
Furthermore there are even some components like the query fragment where
you may want to transmit a non normalized text string. Interoperability
is obviously important, but it just means that some scheme awareness is
required for comparison. Martin has already answered most of that part
so I won't go there.

Michel

Received on Wednesday, 16 April 2003 17:04:40 UTC