RE: Some issues with the IRI document [nfcnfkc-04]

| From: Paul Hoffman / IMC [mailto:phoffman@imc.org] 
| At 5:22 PM -0400 4/15/03, Martin Duerst wrote:
| >Overall, the normalization strategy on IRIs varies according to the 
| >place in the URI:
| >
| >- For domain name part: use NFKC or more (i.e. nameprep), but
| >   gets normalized again (with nameprep) when doing dns lookup.
| >- For the path part: preferably NFKC, but NFC is okay when needed.
| >- For the query part: There may be cases where you on purpose
| >   want to use something totally unnormalized (e.g. when submitting
| >   unnormalized data to a CGI script that normalizes).
| >
| >Does that sound reasonable? Do you think it needs any changes in the 
| >draft, and if yes, what would be those changes?
| 
| It doesn't sound reasonable if you intend IRI comparison to be 
| interoperable. If you don't intend IRI comparison to be 
| interoperable, I still would pick one normalization for each of the 
| three parts, and I would pick NFKC, but you don't have to be 
| consistent if interoperability isn't important.
| 
| Am I the only person who worries about IRI comparison being 
| interoperable?

I really think it is a bad idea to try to enforce NFKC on all components
of a IRI string. What is tolerable for a host/domain name is not for
many other components. NFKC removes many subtleties from the character
repertoire that may have to be preserved for some schemes. For all
issues/concerns with NFKC you can check the Unicode TR15
http://www.unicode.org/reports/tr15/. In general, NFC is much better.
Furthermore there are even some components like the query fragment where
you may want to transmit a non normalized text string. Interoperability
is obviously important, but it just means that some scheme awareness is
required for comparison. Martin has already answered most of that part
so I won't go there.

Michel

Received on Wednesday, 16 April 2003 17:04:40 UTC