Date: Wed, 7 May 1997 11:23:12 +0200 (MET DST) From: "Martin J. Duerst" <email@example.com> To: "Alain LaBont/e'/" <firstname.lastname@example.org> cc: URI mailing list <email@example.com> Subject: Re: "Difficult Characters" draft In-Reply-To: <firstname.lastname@example.org> Message-ID: <Pine.SUN.3.96.970507104936.245Y-100000@enoshima> On Mon, 21 Apr 1997, Alain LaBont/e'/ wrote: > Don't forget that French French don't have all uppercase letters on their > PC keyboards... even if there are Canadian standards (CAN/CSA Z243.200) and > ISO (ISO/IEC 9995-3) standards for doing so. So capitalization remains a > problem in practice for the French people on upper case letters. Some > French keyboards have this, not all. I think we pretty much agree that we should discourage URLs with accented uppercase letters. > Not so, I demonstrated this in my earlier note about my insurance agent web > page. People care (of course), servers care, or browsers care and whoever > or whichwever does the correction, the net result is that equivalences are > done today and end-users got used to this... at least some... and likely a > big lot. End users that have the perception that URLs ignore case will meet bad surprises and have to correct their oppinion some day. And the main reason that we have case equivalence in DNS is the time at which DNS was created, when case distinction was not something you could assume a computer could do (human beings always have been able to do it :-). > >> Fortunately, it's possible that equivalence-based matching > >> could be deployed for URLs; > > > >That's interesting. But it would be a lot more work than the > >conversions from and to UTF-8 that I have suggested for backwards > >compatibility and that have raised great concerns from Roy. > > There exists methods for this in actual practice and it is about to be > standardized in ISO/IEC 14651 which defines an API for charactre string > comparisons at different level of precision. It works if you have the expectations of the user available when doing the eqivalence. But it doesn't work otherwise. This is easily shown by example. Assume somebody in Turkey puts up a server, and installs this server so that equivalences are done on the various variants of I according to Turkish expectations (matching uppercase and lowercase dotted i and uppercase and lowercase dotless I). Now assume that there is an URL http://www.xxx.com/izmir. If this is accessed by a Western European user, and this user types HTTP://WWW.XXX.COM/IZMIR the URL won't match because of the "I" that for the Turkish doesn't match with the "i". This will be a rare case, but it will be all the more surprising. It will be impossible for an average user to learn the message: "Always care about case to be on the safe side" because there are not enough examples to strengthen this message. But it will nevertheless still be true; it will still be the only thing that guarantees a response. > >We don't want to ask the French user more than the US user, > >when compared to his/her language abilities. And up to now, > >we don't. > > You do. If equiavlences are not processed adequately, given that > equivalence processing exists today. You ask either exact match or match > independent of case but dependent on accents... that's not good enough... Well, I don't actually propose that. I just say that it wouldn't be to strange to consider case equivalences but not accent equivalences. In sorting, accents also have higher distinctive power than case, don't they. > See ISO/IEC CD 14651 or CAN/CSA Z243.4.1 (published in 1992, revised this > year -- characters have been added but the logic is the same) and CAN/CSA > Z243.230 (this one to be published this year)... The above standard is a sorting standard that can be used for matching on various levels and for searching, and can be tailored to various user expectations. But they don't work for URLs, because they would need tailoring options to be transmitted with the URL from the client to the server (and these tailoring options can necessitate a rather large data volume in some cases). Also, they are unsuitable and lead to surprises for the users if they are not applied on all servers and services (which is impossible). I agree that it would be great to have a lot of user-friendliness, with servers correcting all kinds of mistakes, from case to accents to spelling and whatnot. But I think it is wrong to create expectations that can only partially be fulfilled and will confuse the user. The situation: "Copy it exactly, with case and everything." is about 5% less user friendly than a highly sophisticated and user- tailored equivalence engine, in particular if irregular casings (or uppercase in general where it is not part of the grammar as in German) and unusual case-accent combinations (as French uppercase accented characters) are avoided. At least 99% of the users of bicameral scripts can easily distinguish case and so on if they are told to do so. [The remaining 1% or less are the people that might have problems distinguishing similar-looking letters such as 'd', 'b', 'q', and 'p' and so on.] So in the end, the strategy "Copy it exactly, with case and everything." is much more user friendly, because it is the only one that works consistently. Regards, Martin.