- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Thu, 25 Oct 2007 16:02:43 +0900
- To: Daniel Dardailler <danield@w3.org>
- Cc: Najib Tounsi <ntounsi@emi.ac.ma>, "'WWW International'" <www-international@w3.org>, W3C Offices <w3c-office-pr@w3.org>, public-i18n-core@w3.org
Hello Daniel, At 21:41 07/10/23, Daniel Dardailler wrote: >Hello Martin, all > >ICANN heard/hears very well the complaints of those who say they are taking too much time to i18nize DNS, not to mention localize it, so I'm not sure adding our own critics would help. Good to hear. Are they just hearing, or are they actually listening? >I've always been told that the tests were necessary and long because of the paramount importance of the root integrity, which I have a hard time pushing against personnally since I pushed for more QA at W3C from day one (at the price of speed, clearly) to ensure better testing of our specs. I don't want to question the importance of the root integrity. But the last test I remember was about a year ago, and everybody who knew anything about the DNS was able to predict that it wouldn't cause any problems way beforehand. I'm all for tests, and I know they take some time, but I know the difference between testing for real and testing to let time pass when I see it. >The real issue today is the policy debate for creating new TLDs by the dozen, and whether or not the system will scale right away in the IDN space. Scaling issues are definitely difficult to predict, but for that, most kinds of testing doesn't help. And adding IDN TLDs shouldn't be done in a way that suddenly increases the number of TLDs by an order of magnitude. Adding just the IDN ccTLDs in what I call the first stage below will probably less than double the overall number of ccTLDs. This will help to assess scalability issues and to take countermeasures (the DNS has scaled extremely well and fast in the later half of the '90ies, and the basic assumption can only be that it will handle quite a bit more scaling if handled correctly). >Note that this debate alone in gTLD space, without IDN complications, took that long to actually start moving again. Another issue is the translation/transposition of country code names for ccTLD, which wasn't looking good - as far as a standard is concerned - on the ISO side last year. I have heard that a new work proposal from BSI was voted down. I think it's not realistic to ask ISO to do this work. It is also not realistic, and not necessary, to expect that transliterations (that's the technical term that probably comes closest to what's needed) are created for all country codes in all scripts. As an example, I don't think that there is a high demand for a ccTLD in the Mongolian script for Switzerland. So I think these ccTLDs should be staged, first creating those for scripts that are widely used in a particular country (as criteria, things such as "does that script appear on the country's coins or banknotes", "is that script used in official publications" and so on can be used). The second stage may include minorities (e.g. Arabic for France, Punjabi for the UK, maybe Tamil for Switzerland and so on). The third stage would address tourists, and the fourth stage, if ever, could try to reach full coverage. Please note that above, we are always speaking about script, not language. That's very important, current TLDs (both cc and g) are in the Latin script, mostly pretty language-agnostic or at least multi-language. >People need to really understand that IDNs are not free-4-all Unicode strings, and that DNS in its current state is not designed with that in mind (search for John Klensin analysis on that point). I have read John's various documents on this topic, have listened to talks from him, and have discussed things directly with him. He raises a few good points, but many times makes an elephant out of a mouse, or lets readers things about elephants when they should think about mice. It may also be worth mentioning that in general, domain names are indeed pretty much "free-for-all" strings, i.e. people just come up with what they like, and register it. DNS wasn't designed as a search engine, but it was definitely designed for use by human. Also, on a second level, many IDN labels already exist, without any big problems. What should definitely be kept for TLDs, at least until there is more experience for gTLDs in the US-ASCII area, is a restriction to short, somewhat cryptic (even though easily memorable) strings. That's the area where traditionally, TLDs have differed from the other labels in a domain name. >IMO, IDNs, like TLDs in general, are identifiers akin vehicle license plates, with the same cross-community border interop issues - and maybe the same solution (e.g. ascii subset being used in a lot of regions). Well, licence plates are a good example actually. In Japan, they use Kanji and Hiragana. In Germany, they use Umlauts. In many Arabic countries, they use Arabic letters and numerals. Najib can tell us what Marocco does. China apparently uses Hanzi and Latin letters, and Russia uses a subset of Cyrillic letters that can also be recognized as Latin letters. See e.g. http://en.wikipedia.org/wiki/Vehicle_registration_plate and http://www.worldlicenseplates.com/. Some Arabic countries use licence plates with text and numbers in both Arabic and Latin script. So there goes the "ascii subset" idea. >Not to mention that URLs are supposed to be opaque.. The short answer is that repeating this doesn't make it true. The long answer is that one has to carefully distinguish between overly general statements like the above, and statements such as the following: "If you want your URIs to last long, using _only_ words that you think you know what they mean now, but that may obviously change in meaning in the long term, is a bad idea; adding e.g. something date-related is good." or: "If you look at an URI, and there are components that look like words, these words may (and often do) mean what you think they mean, but there's no guarantee for that." Such statements make a lot of sense, but they only slighly diminish the value of being able to use URIs with meaningful components. To quote from a presentation I gave just a bit more than 10 years ago: (see p. 2 of http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-URI.pdf) Identifiers that contain recognizable components are easier to: - Devise (what is my identifier for X?) - Memorize (what was the identifier of X?) - Guess (what could be the identifier of X?) - Understand (what does X refer to?) - Correct (spelling errors) - Identify with (nice identifier, isn’t it?) - Manipulate (write, type,...) Obviously, this applies to identifiers using Latin characters for those who are familiar with the Latin script, as well as to identifiers in other scripts for those who are familiar with these other scripts. >Speaking of reasons why URNs aren't opaque, I also wonder if people have thought carefully of the print-digital interface problem we're facing with IDNs becoming popular (I remember talking with Richard about that) and no Unicode familiarity in the masses (e.g. I know how to enter Γρ.αφ in my computer even though I don't have a greek keyboard, but my daughter doesn't - OK my wife does too, but it's a special case :) [I assume you mean URIs, or actually IRIs, not URNs.] Greek again is an excellent example. Well-educated people in the Western hemisphere are just about as familiar with the Greek script as many people in regions that don't primarily use the Latin script are familiar with the Latin script. If you understand that for you and the people around you, having to use Greek in URIs would be a big bother (input only being a part of it), then hopefully you can understand that it's a big bother for some people to be limited to Latin in URIs. And with the "One laptop per child" (http://laptop.org/) moving towords mass production (I saw one last week at the Unicode Conference), the number of these people will increase even more. It is clear that this answer doesn't help you input an IRI including e.g. Kanji. But if it makes manipulating a Web address easier for 99.9% of the actual users (an IRI including Kanji will point to a Japanese Web page, mostly being read by Japanese users or people who otherwise read/use Japanese), while making it more difficult for 0.1% or so, in my view that's a big gain. After all, you can always fall back to punycode or %HH-escapes. These don't hurt if you anyway can't read the actual characters (if you can read the actual characters, then seeing punycode hurts a lot). There is also nothing prohibiting somebody from serving a domain both with one (or several) IDNs as well as with a US-ASCII-only domain name. >Do you by any chance know which consortia or ISO group is currently or has worked on licence plate normalisation ? How did they solve their I18N issues, and on which ground ? Police Interop comes first ? See above. The usability benefits for the local population, including the local police, come first. If 99.9% of the potential users can remember or note down a car licence number faster because it uses the native script, but 0.1% of potential users don't manage to remember it or note it down, then that's a net average gain. Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Thursday, 25 October 2007 07:04:31 UTC