Re: Some issues with the IRI document [nfcnfkc-04] from Martin Duerst on 2003-04-15 (public-iri@w3.org from April 2003)

From: Martin Duerst <duerst@w3.org>
Date: Tue, 15 Apr 2003 17:22:13 -0400
To: Paul Hoffman / IMC <phoffman@imc.org>, public-iri@w3.org
Message-Id: <4.2.0.58.J.20030410171239.05ff7370@localhost>

Hello Paul,

Many thanks for your comments, and for resending them to this list.
I'll address them one by one (not necessarily in your order) to
associate them with existing issues or assign new issues.

At 08:14 03/04/08 -0700, Paul Hoffman / IMC wrote:

>[[ I sent this to Martin a few weeks ago, but we agreed that it was best 
>to bring up on this new list. And, just to be clear, I don't think that 
>IRIs should be debated to death, but the document needs to be clear. ]]
>
>I think the document is fairly good, although I'm not much of a URI 
>person, so I could be way off. After studying it this morning, I see where 
>I got confused. I also disagree with some technical choices you make later 
>in the spec.

>Technical issues:
>
>You use NFC in Section 3.1. This goes against the theme of the guidelines 
>in section 6. NFKC will cause less surprise if an IRI contains 
>compatibility characters, so you should use NFKC instead, regardless of 
>the history of NFC in the W3C.

I have noted this issue as
http://www.w3.org/International/iri-edit/Overview.html#nfcnfkc-04

The guidelines in section 6 are non-normative. They just point out that
if you have a choice, using NFKC is obviously better. Section 3.1 is
normative, and defines interoperability requirements.

Whether NFKC or NFC will cause less surprise will depend on the case
at hand. NFKC does things such as removing circles around characters,
which may come at a real surprise. NFKC also removes distinctions that
most file systems are able to make, and can be expected to continue to
make. A typical example would be half-width katakana vs. full-width
katakana.

Overall, the normalization strategy on IRIs varies according to the
place in the URI:

- For domain name part: use NFKC or more (i.e. nameprep), but
   gets normalized again (with nameprep) when doing dns lookup.
- For the path part: preferably NFKC, but NFC is okay when needed.
- For the query part: There may be cases where you on purpose
   want to use something totally unnormalized (e.g. when submitting
   unnormalized data to a CGI script that normalizes).

Does that sound reasonable? Do you think it needs any changes in the
draft, and if yes, what would be those changes?

Regards,    Martin.

Received on Tuesday, 15 April 2003 17:36:29 UTC