for the record...

Leslie Daigle (leslie@Bunyip.Com)
Tue, 8 Sep 1998 10:29:57 -0400 (EDT)


Date: Tue, 8 Sep 1998 10:29:57 -0400 (EDT)
From: Leslie Daigle <leslie@Bunyip.Com>
To: uri@Bunyip.Com
Message-ID: <Pine.SUN.3.95.980908102619.11768A-100000@mocha.bunyip.com>
Subject: for the record...


It's been pointed out to me that my mail software was (argh!) incorrectly
set up to transmit 8-bit characters...  so, recent posts of mine were
fairly hard to decipher, as I understand it.

Attached, for the sake of clarity, is a resend of my most recent/detailed
message; hopefully the non-ascii characters will be correctly transmitted
this time.

Leslie.

----------------------------------------------------------------------------

    If cats had bumper stickers:                  Leslie Daigle
                                             
      "I wake for food."                          Bunyip Information Systems
                -- ThinkingCat                    (514) 875-8611
                                                  leslie@bunyip.com
----------------------------------------------------------------------------

---------- Forwarded message ----------
Date: Wed, 2 Sep 1998 09:27:20 -0400 (EDT)
From: Leslie Daigle <leslie@bunyip.com>
To: "Martin J. Duerst" <duerst@w3.org>
Cc: Larry Masinter <masinter@parc.xerox.com>,
    URI distribution list <uri@bunyip.com>
Subject: RE: iDNR, an alternative name resolution protocol

Howdy,

On Wed, 2 Sep 1998, Martin J. Duerst wrote:
> 
> - A minimum that should be achieved by normalization at the origin;
>   this is mainly to eliminate pure encoding duplicates such as they
>   appear with precomposed/decomposed. At W3C, we are coordinating
>   this work with Unicode; they have already issued a draft on this
>   issue (http://www.unicode.org/unicode/reports/tr15/), on which
>   also comments are welcome.

Good.


> - An even larger class of equivalences that would be used e.g. for
>   tools that check for spoofing attempts. This may include things
>   such as wrongly interpreted encodings (e.g. something that is
>   actually Latin-1 instead of UTF-8,...) and almost everything that
>   didn't go into the last item for a particular case.

Okay.

But,


> - Some larger equivalences that may be offered as "quality of service"
>   (e.g. for the directory/file component and case-insensitivity for
>   many HTTP servers) or may be part of the protocol/scheme/scheme
>   component,... (e.g. case folding for domain names).

My only point is that, where there is "similarity matching", provided
as a user aide, there is loss of "discrimination capability" for
the service provider, and all parties MUST be working from the same rules.

As an obviously-constructed example, let's look at Alain's example
of "du" and "d".  (This is only peripherally about case matching, now,
and more generally about what should be considered matches in fuzzy
matching).  

A service might use 

	http://someservice.com/clientaccounts/client_no/montant_du_mois
	    for a customer to see the amount of service usage for the
	    month (measured in hits, if its a search service).
	    
	http://someservice.com/clientaccounts/client_no/montant_d_mois
	    for a customer to see the account balance for the month.
	    (I.e., "amount due (month)", as opposed to "amount due (annual)").

As I said, the example is a bit forced, but that's due to my lack of
imagination, not that it's an unrealistic concern.

If we say that "u" and "" (or "U" and "", if I repeat the above
example in all caps) should be considered as matching for URIs, then 
the service provider cannot provide those two as distinct URIs.
The discussion as to whether that is too limiting, entrenches the 
poor support for other languages in Internet technologies,  or is
perfectly acceptable is a separate discussion.  

What would be dangerous, in my opinion, is to NOT make it CLEAR
whether or not these things should match (algorithmically, deterministically),
because the service might legitimately construct a web site using 
these two uri's, serve it reliably using HTTP server of BrandA for years,
switch to HTTP server of BrandB because it provides better performance,
and suddenly wonder why customers are unable to access

	http://someservice.com/clientaccounts/client_no/montant_d_mois

and eventually discovering it's because BrandB considers "u" and ""
similar enough to match.  Repeat the experiment for client software,
and cache software and...

Leslie.



----------------------------------------------------------------------------

    If cats had bumper stickers:                  Leslie Daigle
                                             
      "I wake for food."                          Bunyip Information Systems
                -- ThinkingCat                    (514) 875-8611
                                                  leslie@bunyip.com
----------------------------------------------------------------------------