Re: Internationalized Domain Names (IDNs) in progress from Martin Duerst on 2007-10-25 (www-international@w3.org from October to December 2007)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Thu, 25 Oct 2007 16:02:43 +0900
To: Daniel Dardailler <danield@w3.org>
Cc: Najib Tounsi <ntounsi@emi.ac.ma>, "'WWW International'" <www-international@w3.org>, W3C Offices <w3c-office-pr@w3.org>, public-i18n-core@w3.org
Message-Id: <6.0.0.20.2.20071024171149.0818db10@localhost>

Hello Daniel,

At 21:41 07/10/23, Daniel Dardailler wrote:
>Hello Martin, all
>
>ICANN heard/hears very well the complaints of those who say they are taking too much time to i18nize DNS, not to mention localize it, so I'm not sure adding our own critics would help.

Good to hear. Are they just hearing, or are they actually listening?

>I've always been told that the tests were necessary and long because of the paramount importance of the root integrity, which I have a hard time pushing against personnally since I pushed for more QA at W3C from day one (at the price of speed, clearly) to ensure better testing of our specs.

I don't want to question the importance of the root integrity. But the last test
I remember was about a year ago, and everybody who knew anything about the DNS
was able to predict that it wouldn't cause any problems way beforehand. I'm all
for tests, and I know they take some time, but I know the difference
between testing for real and testing to let time pass when I see it.

>The real issue today is the policy debate for creating new TLDs by the dozen, and whether or not the system will scale right away in the IDN space.

Scaling issues are definitely difficult to predict, but for that,
most kinds of testing doesn't help. And adding IDN TLDs shouldn't
be done in a way that suddenly increases the number of TLDs by an
order of magnitude. Adding just the IDN ccTLDs in what I call the
first stage below will probably less than double the overall number
of ccTLDs. This will help to assess scalability issues and to take
countermeasures (the DNS has scaled extremely well and fast in
the later half of the '90ies, and the basic assumption can only
be that it will handle quite a bit more scaling if handled correctly).

>Note that this debate alone in gTLD space, without IDN complications, took that long to actually start moving again. Another issue is the translation/transposition of country code names for ccTLD, which wasn't looking good - as far as a standard is concerned - on the ISO side last year.

I have heard that a new work proposal from BSI was voted down. I think it's
not realistic to ask ISO to do this work. It is also not realistic, and not
necessary, to expect that transliterations (that's the technical term that
probably comes closest to what's needed) are created for all country codes
in all scripts. As an example, I don't think that there is a high demand
for a ccTLD in the Mongolian script for Switzerland.

So I think these ccTLDs should be staged, first creating those for scripts
that are widely used in a particular country (as criteria, things such as
"does that script appear on the country's coins or banknotes", "is that script
used in official publications" and so on can be used). The second stage may
include minorities (e.g. Arabic for France, Punjabi for the UK, maybe Tamil
for Switzerland and so on). The third stage would address tourists, and the
fourth stage, if ever, could try to reach full coverage.

Please note that above, we are always speaking about script, not language.
That's very important, current TLDs (both cc and g) are in the Latin script,
mostly pretty language-agnostic or at least multi-language.

>People need to really understand that IDNs are not free-4-all Unicode strings, and that DNS in its current state is not designed with that in mind (search for John Klensin analysis on that point).

I have read John's various documents on this topic, have listened to talks from him,
and have discussed things directly with him. He raises a few good points, but many times
makes an elephant out of a mouse, or lets readers things about elephants when they
should think about mice.

It may also be worth mentioning that in general, domain names are indeed
pretty much "free-for-all" strings, i.e. people just come up with what
they like, and register it. DNS wasn't designed as a search engine, but
it was definitely designed for use by human. Also, on a second level,
many IDN labels already exist, without any big problems.

What should definitely be kept for TLDs, at least until there is more
experience for gTLDs in the US-ASCII area, is a restriction to short,
somewhat cryptic (even though easily memorable) strings. That's the
area where traditionally, TLDs have differed from the other labels
in a domain name.

>IMO, IDNs, like TLDs in general, are identifiers akin vehicle license plates, with the same cross-community border interop issues - and maybe the same solution (e.g. ascii subset being used in a lot of regions).

Well, licence plates are a good example actually. In Japan, they use Kanji and Hiragana.
In Germany, they use Umlauts. In many Arabic countries, they use Arabic letters and numerals.
Najib can tell us what Marocco does. China apparently uses Hanzi and Latin letters,
and Russia uses a subset of Cyrillic letters that can also be recognized as Latin
letters. See e.g. http://en.wikipedia.org/wiki/Vehicle_registration_plate and
http://www.worldlicenseplates.com/. Some Arabic countries use licence plates
with text and numbers in both Arabic and Latin script.

So there goes the "ascii subset" idea.

>Not to mention that URLs are supposed to be opaque..

The short answer is that repeating this doesn't make it true.

The long answer is that one has to carefully distinguish between
overly general statements like the above, and statements such as
the following:
"If you want your URIs to last long, using _only_ words that you think you
know what they mean now, but that may obviously change in meaning in the
long term, is a bad idea; adding e.g. something date-related is good."
or:
"If you look at an URI, and there are components that look like words,
these words may (and often do) mean what you think they mean, but
there's no guarantee for that."
Such statements make a lot of sense, but they only slighly diminish the
value of being able to use URIs with meaningful components.

To quote from a presentation I gave just a bit more than 10 years ago:
(see p. 2 of http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-URI.pdf)

Identifiers that contain recognizable components are easier to:
- Devise (what is my identifier for X?)
- Memorize (what was the identifier of X?)
- Guess (what could be the identifier of X?)
- Understand (what does X refer to?)
- Correct (spelling errors)
- Identify with (nice identifier, isn’t it?)
- Manipulate (write, type,...)

Obviously, this applies to identifiers using Latin characters for those who
are familiar with the Latin script, as well as to identifiers in other scripts
for those who are familiar with these other scripts.

>Speaking of reasons why URNs aren't opaque, I also wonder if people have thought carefully of the print-digital interface problem we're facing with IDNs becoming popular (I remember talking with Richard about that) and no Unicode familiarity in the masses (e.g. I know how to enter Γρ.αφ in my computer even though I don't have a greek keyboard, but my daughter doesn't - OK my wife does too, but it's a special case :)

[I assume you mean URIs, or actually IRIs, not URNs.]
Greek again is an excellent example. Well-educated people in the Western hemisphere
are just about as familiar with the Greek script as many people in regions that don't
primarily use the Latin script are familiar with the Latin script. If you understand
that for you and the people around you, having to use Greek in URIs would be a big
bother (input only being a part of it), then hopefully you can understand that it's
a big bother for some people to be limited to Latin in URIs. And with the
"One laptop per child" (http://laptop.org/) moving towords mass production
(I saw one last week at the Unicode Conference), the number of these people
will increase even more.

It is clear that this answer doesn't help you input an IRI including e.g. Kanji.
But if it makes manipulating a Web address easier for 99.9% of the actual users
(an IRI including Kanji will point to a Japanese Web page, mostly being read
by Japanese users or people who otherwise read/use Japanese), while making it
more difficult for 0.1% or so, in my view that's a big gain. After all, you can
always fall back to punycode or %HH-escapes. These don't hurt if you anyway
can't read the actual characters (if you can read the actual characters, then
seeing punycode hurts a lot). There is also nothing prohibiting somebody
from serving a domain both with one (or several) IDNs as well as with a
US-ASCII-only domain name.

>Do you by any chance know which consortia or ISO group is currently or has worked on licence plate normalisation ? How did they solve their I18N issues, and on which ground ? Police Interop comes first ?

See above. The usability benefits for the local population, including the local
police, come first. If 99.9% of the potential users can remember or note down
a car licence number faster because it uses the native script, but 0.1%
of potential users don't manage to remember it or note it down, then that's
a net average gain.

Regards, Martin.

#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

Received on Thursday, 25 October 2007 07:04:31 UTC