- From: Martin Duerst <duerst@w3.org>
- Date: Wed, 16 Apr 2003 14:43:46 -0400
- To: Paul Hoffman / IMC <phoffman@imc.org>, public-iri@w3.org
At 19:20 03/04/15 -0700, Paul Hoffman / IMC wrote: >At 5:22 PM -0400 4/15/03, Martin Duerst wrote: >>Overall, the normalization strategy on IRIs varies according to the >>place in the URI: >> >>- For domain name part: use NFKC or more (i.e. nameprep), but >> gets normalized again (with nameprep) when doing dns lookup. >>- For the path part: preferably NFKC, but NFC is okay when needed. >>- For the query part: There may be cases where you on purpose >> want to use something totally unnormalized (e.g. when submitting >> unnormalized data to a CGI script that normalizes). >> >>Does that sound reasonable? Do you think it needs any changes in the >>draft, and if yes, what would be those changes? > >It doesn't sound reasonable if you intend IRI comparison to be >interoperable. If you don't intend IRI comparison to be interoperable, I >still would pick one normalization for each of the three parts, and I >would pick NFKC, but you don't have to be consistent if interoperability >isn't important. > >Am I the only person who worries about IRI comparison being interoperable? The W3C TAG has discussed URI equivalence for a long time. See http://www.w3.org/2001/tag/ilist#URIEquivalence-15 for more pointers. The result of that discussion is being incorporated into the update of RFC 2396, (see http://www.apache.org/~fielding/uri/rev-2002/rfc2396bis.html#rfc.section.6) and its consequences are considered for the IRI draft under http://www.w3.org/International/iri-edit#iriequivalence-01. The result of that discussion is that there are many different conditions which require or use different kinds of equivalence. These range from simple codepoint-by-codepoint equivalence (used e.g. in XML namespaces) to highly sophisticated heuristics used by spiders and robots. The consequences, for URIs, are described in section 6 of http://www.textuality.com/tag/uri-comp-4. Similar consequences apply for IRIs. Because we do not know how a particular field is treated with respect to text normalization, it would be inappropriate to allow third parties to normalize an IRI arbitrarily. This does not contradict the recommendation that if you create a resource, and an IRI for that resource, you try to be as normalized as possible (i.e. NFKC if possible). This is similar to the upper-case/lower-case problems in URIs. Some parts of an URI are case-insensitive (domain name). For others, it is unclear whether they are case-sensitive or case-insensitive, or something in between (e.g. case-sensitive, but if you use the wrong case, you get a result such as a 'Multiple choices' or so). The best recipe we have there is that the generator uses a reasonable capitalization, and when transfering the URI, you don't change capitalization. Regards, Martin.
Received on Wednesday, 16 April 2003 15:09:13 UTC