Re: Some issues with the IRI document [nfcnfkc-04] from Martin Duerst on 2003-04-16 (public-iri@w3.org from April 2003)

From: Martin Duerst <duerst@w3.org>
Date: Wed, 16 Apr 2003 14:43:46 -0400
To: Paul Hoffman / IMC <phoffman@imc.org>, public-iri@w3.org
Message-Id: <4.2.0.58.J.20030416141855.048548c8@localhost>

At 19:20 03/04/15 -0700, Paul Hoffman / IMC wrote:

>At 5:22 PM -0400 4/15/03, Martin Duerst wrote:
>>Overall, the normalization strategy on IRIs varies according to the
>>place in the URI:
>>
>>- For domain name part: use NFKC or more (i.e. nameprep), but
>>   gets normalized again (with nameprep) when doing dns lookup.
>>- For the path part: preferably NFKC, but NFC is okay when needed.
>>- For the query part: There may be cases where you on purpose
>>   want to use something totally unnormalized (e.g. when submitting
>>   unnormalized data to a CGI script that normalizes).
>>
>>Does that sound reasonable? Do you think it needs any changes in the
>>draft, and if yes, what would be those changes?
>
>It doesn't sound reasonable if you intend IRI comparison to be 
>interoperable. If you don't intend IRI comparison to be interoperable, I 
>still would pick one normalization for each of the three parts, and I 
>would pick NFKC, but you don't have to be consistent if interoperability 
>isn't important.
>
>Am I the only person who worries about IRI comparison being interoperable?

The W3C TAG has discussed URI equivalence for a long time.
See http://www.w3.org/2001/tag/ilist#URIEquivalence-15 for more pointers.
The result of that discussion is being incorporated into the update of
RFC 2396, (see
http://www.apache.org/~fielding/uri/rev-2002/rfc2396bis.html#rfc.section.6)
and its consequences are considered for the IRI draft under
http://www.w3.org/International/iri-edit#iriequivalence-01.

The result of that discussion is that there are many different
conditions which require or use different kinds of equivalence.
These range from simple codepoint-by-codepoint equivalence
(used e.g. in XML namespaces) to highly sophisticated heuristics
used by spiders and robots.

The consequences, for URIs, are described in section 6 of
http://www.textuality.com/tag/uri-comp-4. Similar consequences
apply for IRIs.

Because we do not know how a particular field is treated with
respect to text normalization, it would be inappropriate to
allow third parties to normalize an IRI arbitrarily. This does
not contradict the recommendation that if you create a resource,
and an IRI for that resource, you try to be as normalized as
possible (i.e. NFKC if possible).

This is similar to the upper-case/lower-case problems in URIs.
Some parts of an URI are case-insensitive (domain name).
For others, it is unclear whether they are case-sensitive or
case-insensitive, or something in between (e.g. case-sensitive,
but if you use the wrong case, you get a result such as a
'Multiple choices' or so). The best recipe we have there is
that the generator uses a reasonable capitalization, and
when transfering the URI, you don't change capitalization.

Regards,    Martin.

Received on Wednesday, 16 April 2003 15:09:13 UTC