Re: Internationalized Domain Names (IDNs) in progress from Martin Duerst on 2007-10-26 (public-i18n-core@w3.org from October to December 2007)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Fri, 26 Oct 2007 14:54:21 +0900
To: John Cowan <cowan@ccil.org>, Najib Tounsi <ntounsi@emi.ac.ma>
Cc: John Cowan <cowan@ccil.org>, Daniel Dardailler <danield@w3.org>, "'WWW International'" <www-international@w3.org>, W3C Offices <w3c-office-pr@w3.org>, public-i18n-core@w3.org
Message-Id: <6.0.0.20.2.20071026142935.07027ad0@localhost>
One very important property of TDLs that I haven't mentioned
in my previous posts are that they are one-off creations.

This has various implications. One implication is with respect
to spoofing. While for second-level domains (or third-level
domains, in cases such as foo.co.uk) which are open to everybody,
general rules against spoofing have to be designed (e.g. don't mix
scripts,...), on the TLD level, spoofing can be eliminated by
considering each case in turn. The example I'm always using for
this is that Russia in Cyrillic would most naturally get a
two letter code looking like .py, but this would spoof Paraguay,
so something different has to be found for Russia, e.g.
transliterated r-ja (in actual writing what looks like a p
followed by a mirrored R). Case-by-case checking in many ways
is much easier than defining general rules, and this means that
the spoofing scare that some people have is not at all justified
(can easily be avoided) for TLDs.

A second implication is that we only need to find one case
that works reasonably, rather than make all cases work.
As an example, consider the task of finding something equivalent
to .com for the East Asian region using Han ideographs.
[Let's for the moment assume that we want such a thing; I at
least think that such gTLDs should be second priority to ccTLDs.]

In each language/region, there are various Han characters that
are used for companies/commerce/... Some may have simplified/
traditional variants. Some may only be used in some regions,
in other regions being associated with something completely
different.

As an example, company is 公司 in Chinese, but 会社 in
Japanese, so neither of these characters work. But
looking around, a character such as 商 may indeed express the
concept closely enough in all regions. [I have checked this
for the above character, and I think it's true. I have also
been told that this character is also associated with a
dinasty in China, which may be an issue to consider, although
I don't think reserving characters for dinasty TLDs makes any
sense. If somebody knows more, I'm glad to learn.]

Using the same script usually means that there is also some
sharing of culture or vocabulary. It is well known that e.g.
for the Arabic script, there are quite some words of Arabic
(language) origin that are used in languages that are
linguistically not related to Arabic. Same for most other
scripts and languages. It is clear that it may be difficult
to find examples that reach into each and every language
written with the script, but it's also clear that with a
bit of work and deliberation, it should be possible to
cover a high percentage of users, not just an accidental
majority.

In addition, abbreviations often help. They definitely
helped in the cases of .com or .net,... Such abbreviations
can be seen as mere letter combinations, potentially
gaining a meaning of their own. This is more difficult
with longer words.


In conclusion: Try to find cases that work. If we find
something that works, we are done. We don't need to use
the things that don't work.

I think that's the main point where I'm very unhappy with ICANN,
and also with some of what Daniel has said. Rather than to spend
the main energy on finding thing that work, it at least looks like
most energy is spent on trying to find counterexamples and problems
that, if avoided, are not relevant at all.


What's also frustrating is that the counterarguments mostly
seem to be comming from people who have little day-in-day-out
experience with non-Latin scripts, even if they may know a lot
about many scripts and languages. What most of these people
don't realize is that learning another script, e.g. the 26
letters of the Latin alphabet, is not equivalent to being
truely fluent in that script, which independently of the
script takes years. Indeed, in Europe, every second-grader
(and these days indeed most first-graders) know the letters
of the alphabet, but this in no way means they are fluent in
the Latin script. As another example, I have no problems reading
Japanese newspapers or technical publications, and I teach
in Japanese at a Japanese University using my own Japanese
materials. Still, after a total of more than 10 years in
Japan, my speed of reading Japanese is quite a bit lower
than that of reading English or German, and the speed of
finding a word on a page or a topic in a book is even more
different.


Regards,    Martin.


At 11:43 07/10/26, John Cowan wrote:
>Najib Tounsi scripsit:
>
>> Or the 23 countries and territories with a combined population of some 
>> 325 million of users?
>
>Sure, Arabic is the largest language written in Arabic script, but
>Arabic script is used in many countries where Arabic is not spoken,
>and Urdu plus the various kinds of Persian probably account for
>half as many speakers.
>
>If you don't like that example, consider Cyrillic.  Should all
>the ccTLDs in Cyrillic script be Russian-based?  At least
>some of the Latin ccTLDs aren't English-based, even though
>English is far and away the largest Latin-script language.
>
>-- 
>John Cowan  cowan@ccil.org   http://ccil.org/~cowan
>"The exception proves the rule."  Dimbulbs think: "Your counterexample proves
>my theory."  Latin students think "'Probat' means 'tests': the exception puts
>the rule to the proof."  But legal historians know it means "Evidence for an
>exception is evidence of the existence of a rule in cases not excepted from."


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
Received on Friday, 26 October 2007 08:40:56 UTC