- From: Martin J. Duerst <duerst@w3.org>
- Date: Wed, 02 Sep 1998 12:54:04 +0900
- To: Leslie Daigle <leslie@Bunyip.Com>
- Cc: Larry Masinter <masinter@parc.xerox.com>, URI distribution list <uri@Bunyip.Com>
Hello Leslie, At 10:15 98/09/01 -0400, Leslie Daigle wrote: > In particular, it isn't clear to me what "it is useful if unaccented > characters are accepted, when possible, as aliases for accented > characters". Consider, > > in French, "$Bq(J is "e with an acute accent" > in Swedish, "$B‹(J is a completely different letter than "o", to > the extent that it appears in a completely different place > in alphabetic ordering. Alain has given some very good explanations here. The answer is: It depends. Actually, the answer is already "it depends" for the current URIs, with respect to case. I don't think it's realistic to expect us to improve on what hasn't been done better up to now in the very limited ASCII range. That said, I am of course very concerned to get things as well working as possible. URIs are not the only place we get into such problems. In W3C, several working groups have made requests for guidance in this area to the W3C I18N WG, and this WG has already published a working draft for requirements for some of the things that you mention above (and some others). Please have a look at http://www.w3.org/TR/WD-charreq, comments are very wellcome. Getting back to URIs specifically, I see at least three levels that we have to address: - A minimum that should be achieved by normalization at the origin; this is mainly to eliminate pure encoding duplicates such as they appear with precomposed/decomposed. At W3C, we are coordinating this work with Unicode; they have already issued a draft on this issue (http://www.unicode.org/unicode/reports/tr15/), on which also comments are welcome. - Some larger equivalences that may be offered as "quality of service" (e.g. for the directory/file component and case-insensitivity for many HTTP servers) or may be part of the protocol/scheme/scheme component,... (e.g. case folding for domain names). - An even larger class of equivalences that would be used e.g. for tools that check for spoofing attempts. This may include things such as wrongly interpreted encodings (e.g. something that is actually Latin-1 instead of UTF-8,...) and almost everything that didn't go into the last item for a particular case. Regards, Martin.
Received on Wednesday, 2 September 1998 04:43:59 UTC