- From: John C Klensin <klensin@jck.com>
- Date: Wed, 21 Aug 2013 16:48:56 -0400
- To: Shawn Steele <Shawn.Steele@microsoft.com>, Gervase Markham <gerv@mozilla.org>, "Jungshik SHIN (신정식)" <jshin1987@gmail.com>
- cc: Simon Montagu <smontagu@smontagu.org>, public-iri@w3.org, uri@w3.org, idna-update@alvestrand.no, Peter Saint-Andre <stpeter@stpeter.im>, Anne van Kesteren <annevk@annevk.nl>, "www-tag.w3.org" <www-tag@w3.org>
--On Wednesday, August 21, 2013 17:05 +0000 Shawn Steele <Shawn.Steele@microsoft.com> wrote: > IMO, the eszett & even more so, final sigma, are somewhat > display issues. My personal opinion is we need a display > standard (yes, that's not easy Indeed. But it might be worth some effort. > A non-final sigma isn't (my understanding) a valid form of the > word, so you shouldn't ever have both registered. It could > certainly be argued that 2003 shouldn't have done this > mapping. If these are truly mutually exclusive, then the > biggest problem with 2003 isn't a confusing canonical form, > but rather that it doesn't look right in the 2003 canonical > form. However there's no guarantee in DNS that I can have a > perfect canonical form for my label. Microsoft for example, > is a proper noun, however any browser nowadays is going to > display microsoft.com, not Microsoft.com. (Yes, that's > probably not "as bad" as the final sigma example). Right. But I think that you are at risk of confusing two issues. One is that, if the needs of the DNS were the only thing that drove Unicode decisions, we all had perfect hindsight and foresight, and it was easy to make retroactive or flag day corrections, probably all position-dependent (isolated, initial, medial, final in the general case) character variations would be assigned only a code point for the base character with the positional stuff viewed as strictly as a display issue (possibly with an overriding qualifier codepoint). That would have meant no separate code point for a final sigma in Greek; no separate code points for final Kaf, Mem, Nun, Pe, or Tsadi in Hebrew; and so on, i.e., the way the basic Arabic block was handled before the presentation forms were added. If things had been done that way, some of these things would have been entirely a display issue, with the only difficult question for IDNA one of whether to allow the presentation qualifier so as to permit preserving word distinctions in concatenated strings -- in a one-case script, selective use of final or initial character forms would provide the equivalent of using "DigitalResearch" or "SonyStyle" as distinctive domain name. But it wasn't done that way. I can identify a number of reasons why it wasn't and indeed why, on balance, it might have been a bad idea. I assume Mark or some other Unicode expert would have a longer list of such reasons than I do. So we cope. To a first order approximation, the IDNA2003 method of coping was to try to map all of the alternate presentation forms together... except when it didn't. And, to an equally good approximation, IDNA2008 deals with it by disallowing the alternate presentation forms... except when it doesn't. The working group was convinced that the second choice was less evil (or at least leas of a problem) than the first one, but I don't think anyone would really argue that either choice is ideal, especially when it cannot be applied consistently without a lot of additional special-case, code point by code point, rules. Hard problem but, if we come back to the question from Anne that started this thread, I don't think there is any good basis to argue that the IDNA2003 approach is fundamentally better. It is just the approach that we took first, before we understood the problems with it. > Eszett is less clear, because using eszett or ss influences > the pronunciation (at least in Germany, in Switzerland that > can be different). I imagine it's rather worse if you're > Turkish and prefer different i's. For German, nobody is ever > going to expect fußball.ch and fussball.ch to go different > place. I suspect that there are other possible examples that don't have that property. But that is something on which Marcos should comment. Clearly it is within the power of the registry to arrange for "same place" if that is what they want to do. And, if they do that for all such names, this whole discussion is moot in practice. >... > For words that happen to be similar, there's no expectation > that a DNS name is available. AAA Plumbing and all the other > AAA whatever's out there aren't going to be surprised that > AAA.com is already taken. Surprised? Probably not. Willing to fight over who is the "real" AAA? Yes, and we have seen that sort of thing repeatedly. > So why's German more special that > Turkish or English? Because "ß" is really a different letter than the "ss" sequence. And dotless i is really a different letter than the dotted one, just as "o" and "0" or "l" and "1" are. If a registry decides that the potential for spoofing and other problems outweighs the advantages of keeping them separate and potentially allocating them separately and either delegates them to the same entity or blocks one string from each pair, I think that is great. If they make some other decision, that is great too. Where I have a problem is when a browser (or other lookup application) makes that decision, essentially blocking one of the strings, and makes it on behalf of the user without any consideration of local issues or conventions. I might even suggest that, because "O" and "0" and "l" and "1" are more confusable (and hence spoofing-prone) than "ß" and "ss", if you were being logically consistent, you would map all domain labels containing zeros into ones containing "o"s and ones containing "1" into ones containing "l". That would completely prevent the "MICR0S0FT" spoof and a lot of others at the price of making a lot of legitimate labels invalid or inaccessible -- just like the "ß" case. And, like "ß", treating 0 or 1 as display issues would not only not help very much, it would astonish users of European digits. best, john
Received on Wednesday, 21 August 2013 20:49:50 UTC