- From: Leslie Daigle <leslie@Bunyip.Com>
- Date: Wed, 2 Sep 1998 09:27:20 -0400 (EDT)
- To: "Martin J. Duerst" <duerst@w3.org>
- cc: Larry Masinter <masinter@parc.xerox.com>, URI distribution list <uri@Bunyip.Com>
Howdy, On Wed, 2 Sep 1998, Martin J. Duerst wrote: > > - A minimum that should be achieved by normalization at the origin; > this is mainly to eliminate pure encoding duplicates such as they > appear with precomposed/decomposed. At W3C, we are coordinating > this work with Unicode; they have already issued a draft on this > issue (http://www.unicode.org/unicode/reports/tr15/), on which > also comments are welcome. Good. > - An even larger class of equivalences that would be used e.g. for > tools that check for spoofing attempts. This may include things > such as wrongly interpreted encodings (e.g. something that is > actually Latin-1 instead of UTF-8,...) and almost everything that > didn't go into the last item for a particular case. Okay. But, > - Some larger equivalences that may be offered as "quality of service" > (e.g. for the directory/file component and case-insensitivity for > many HTTP servers) or may be part of the protocol/scheme/scheme > component,... (e.g. case folding for domain names). My only point is that, where there is "similarity matching", provided as a user aide, there is loss of "discrimination capability" for the service provider, and all parties MUST be working from the same rules. As an obviously-constructed example, let's look at Alain's example of "du" and "dû". (This is only peripherally about case matching, now, and more generally about what should be considered matches in fuzzy matching). A service might use http://someservice.com/clientaccounts/client_no/montant_du_mois for a customer to see the amount of service usage for the month (measured in hits, if its a search service). http://someservice.com/clientaccounts/client_no/montant_dû_mois for a customer to see the account balance for the month. (I.e., "amount due (month)", as opposed to "amount due (annual)"). As I said, the example is a bit forced, but that's due to my lack of imagination, not that it's an unrealistic concern. If we say that "u" and "û" (or "U" and "Û", if I repeat the above example in all caps) should be considered as matching for URIs, then the service provider cannot provide those two as distinct URIs. The discussion as to whether that is too limiting, entrenches the poor support for other languages in Internet technologies, or is perfectly acceptable is a separate discussion. What would be dangerous, in my opinion, is to NOT make it CLEAR whether or not these things should match (algorithmically, deterministically), because the service might legitimately construct a web site using these two uri's, serve it reliably using HTTP server of BrandA for years, switch to HTTP server of BrandB because it provides better performance, and suddenly wonder why customers are unable to access http://someservice.com/clientaccounts/client_no/montant_dû_mois and eventually discovering it's because BrandB considers "u" and "û" similar enough to match. Repeat the experiment for client software, and cache software and... Leslie. ---------------------------------------------------------------------------- If cats had bumper stickers: Leslie Daigle "I wake for food." Bunyip Information Systems -- ThinkingCat (514) 875-8611 leslie@bunyip.com ----------------------------------------------------------------------------
Received on Wednesday, 2 September 1998 09:49:00 UTC