Date: Tue, 6 May 1997 20:50:15 +0200 (MET DST) From: "Martin J. Duerst" <email@example.com> To: Larry Masinter <firstname.lastname@example.org> cc: "Alain LaBont/e'/" <email@example.com>, URI mailing list <firstname.lastname@example.org> Subject: Re: "Difficult Characters" draft In-Reply-To: <336F5302.64F7@parc.xerox.com> Message-ID: <Pine.SUN.3.96.970506203138.245T-100000@enoshima> Larry, Many thanks for your interesting toughts. On Tue, 6 May 1997, you wrote: > Perhaps you could mention in your draft about the use of > identifiers with characters outside of ASCII that such > use is actually problematic, Well, it is of course a little less easy for some people than using ASCII. > and that some applications > which use canonical identifiers and exact match as a way > of doing symbol lookup when restricted to ASCII-only symbols > might find that users of languages other than English > will be ill-served by such a design; in some applications > using a careful language-sensitive equivalence lookup > (instead of exact-match) would make the software actually > accomodate the needs and practices of such users. Well, there have been some interesting examlpes. But frankly speaking, I don't think that the average French speaker should have more difficulties to transcribe the correct accents than the average US user should have difficulties to get case correct. Quite to the contrary, case in URLs is often rather random, whereas accents in a known word can easily be reconstructed. Also, having to test all possible accented and non-accented versions of a word is much easier than having to test all different capitalizations of a word. What's remaining is most probably the basic surprise when a French user is first confronted with an accented URL. She might not believe it true and just type it in without accents, and get an error :-). > The mail in the recent week has been full of good examples > of places where canonicalization is either ill-specified > or context-sensitive, and "equivalence matching" > would be far more practical. Equivalence matching could save a lot of US typos also. But nobody ever cared to do equivalence matching for them. It's assumed that the user type things in correctly. > Fortunately, it's possible that equivalence-based matching > could be deployed for URLs; That's interesting. But it would be a lot more work than the conversions from and to UTF-8 that I have suggested for backwards compatibility and that have raised great concerns from Roy. > other kinds of exact-match > names will require a separate analysis. Both DNS and HTTP-servers > (if not FTP servers) could be coaxed into doing equivalence-matching > instead of exact matching for reference lookup; if they also respond > with the server's view of the "canonical" name, then we > won't be asking clients to do what it seems like is nearly > impossible. What do you mean by "nearly impossible"? Is it the implementation of the normalization algorithms that I propose? (one could send an URL to some "normalization server" if that's really considered a problem.) Is it a consistent collection of normalization rules and warnings about codepoints not to use? Is it that you think we are asking too much of the user? We don't want to ask the French user more than the US user, when compared to his/her language abilities. And up to now, we don't. Regards, Martin.