W3C home > Mailing lists > Public > www-international@w3.org > July to September 2007

RE: Urdu IDNs: Characters in domain names

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Wed, 08 Aug 2007 11:15:32 +0900
Message-Id: <>
To: "Jonathan Rosenne" <rosennej@qsm.co.il>, "'Richard Ishida'" <ishida@w3.org>, <www-international@w3.org>, <public-iri@w3.org>
Cc: "'Sarmad Hussain'" <sarmad.hussain@nu.edu.pk>

At 05:23 07/07/31, Jonathan Rosenne wrote:
>Sarmad's mail is specific about TLDs. TLDs are specific and limited in 
>number, and it should be possible to determine national language TLDs for 
>most countries while avoiding the problems mentioned below.

This confuses various issues. First, TLDs, at least up to now, mostly are
not language-based, but script-based. It makes a lot of sense to keep it
that way, it reduces the number of TLDs enormously.

Second, TLDs are indeed currently limited in number, but their number
is increasing. It used to be that it was easy to remember all generic
(non-country) TLDs, but that's no longer the case (see

Third, it is unclear whether this is talking about ccTLDs (country-code
TLDs) or generic TLDs. "TLDs for most contries" suggests the former,
but some part of the above text seems to include the later.

>The use of URLs in the national language within a TLD of the same language 
>should be determined at the country or language level and not necessarily 
>by an international organization.

Given that TLDs should be mostly script-based, this makes it a bit
more difficult to determine things at a country or language level.
Most scripts are used in more than one country, by more than one
language. However, I agree that a lot of the consensus discussion
should be done at a level where those directly affected are involved.

>I am sorry to have to say this, but the international organizations 
>controlling the internet have not shown in the past sufficient 
>understanding for people who do not use the Latin script and their 
>problems. By show I mean action, not talk.

I fully agree that ICANN is way too slow when it comes to non-Latin TLDs.

>At least for Hebrew, I do not think that vowels should be removed. For 
>native users of the language vowels that may or may not be present are 
>natural. A site may register itself twice, with and without vowels, or 
>duplicate their pages, or handle it in their server, if they want to.

For Hebrew as a language, vowels may be moderately important. For other
languages written in Hebrew, vowels may be more or less important.
I therefore agree that removing them at the client is a bad idea.

As long as registration is only needed twice, things might be fine.
But there may be cases where there are more variants with vowels,
which can lead to an explosion of the number of needed registrations.

So vowels have to be addressed very carefully by registry policies.

As others have already said, artificially providing 'translated'
TLDs by a plugin or something similar at the client level leads
to dangerous confusion and fragmentation in the URI/IRI space.
In my view, what will ultimately happen is that there will be
e.g. Arabic-script TLDs, and Arabic-script domain names will just
be registered in these TLDs because that's easiest for the users.
Therefore, something like ARABIC.com is a temporary artefact
that we should be able to get away from quickly, rather than
stick to by providing a plugin to translate from some Arabic-script
equivalent to .com.

Regards,    Martin.

>> -----Original Message-----
>> From: public-iri-request@w3.org [mailto:public-iri-request@w3.org] On
>> Behalf Of Richard Ishida
>> Sent: Monday, July 30, 2007 11:11 PM
>> To: www-international@w3.org; public-iri@w3.org
>> Cc: 'Sarmad Hussain'
>> Subject: Urdu IDNs: Characters in domain names
>> Sarmad Hussain, at the Center for Research in Urdu Language Processing
>> FAST National University, Pakistan, is looking at enabling Urdu IDNs
>> based on ICANN recommendations, but this may lead to similar approaches
>> in a number of other countries.
>> There are some aspects to Sarmads proposal, arising from the nature of
>> the Arabic script used for Urdu, that raise some interesting questions
>> about the way IDN works for this kind of language. These have to do
>> with the choice of characters allowed in a domain name.
>> For example, there is a suggestion that users should be able to use
>> certain characters when writing a URI in Urdu which are then either
>> removed (eg. vowel diacritics) or converted to other characters (eg.
>> Arabic characters) during the conversion to punycode by a user agent
>> plug-in.
>> This is not something that is normally relevant for English-only URIs,
>> because of the relative simplicity of our alphabet. There is much more
>> potential ambiguity in Urdu for use of characters. Note, however, that
>> the proposals Sarmad is making are language-specific, not script-
>> specific, ie. Arabic or Persian (also written with the Arabic script)
>> would need some slightly different rules.
>> I find myself wondering whether you could use a plug-in to strip out or
>> convert the characters while converting to punycode. People typing IDNs
>> in Urdu would need to be aware of the need for a plug-in, and would
>> still need to know how to type in IDNs if they found themselves using a
>> browser that didnt have the plug-in (eg. the businessman who is
>> visiting a corporation in the US that prevents ad hoc downloads of
>> software). On the one hand, I wonder whether we can expect a user who
>> sees a URI on a hard copy brochure containing vowel diacritics to know
>> what to do if their browser or mail client doesnt support the plug-in.
>> On the other hand, a person writing a clickable URI in HTML or an email
>> would not be able to guarantee that users would have access to the
>> plug-in. In that case, they would be unwise to use things like short
>> vowel diacritics, since the user cannot easily change the link if they
>> dont have a plug-in. Imagine a vowelled IDN coming through in a plain
>> text email, for example: the reader may need to edit the email text to
>> get to the resource rather than just click on it. Not likely to be
>> popular.
>> Another alternative is to do such removal and conversion of characters
>> as part of the standard punycode conversion process. This, I suspect,
>> would necessitate every browser to have access to standardised tables
>> of characters that should be ignored or converted for any language. But
>> there is an additional problem in that the language would need to be
>> determined correctly before such rules were applied - that is, the
>> language of the original URI. That too seems a bit difficult.
>> So I can see the need, but Im not sure what the solution would be. Im
>> inclined to think that creating a plug-in might create more trouble
>> than benefit, by replacing the problems of errors and ambiguities with
>> the problems of uninteroperable IDNs.
>> There is an Excel file attached that lists which characters in the
>> Arabic block would be appropriate for Urdu IDNs.  I will also list the
>> characters below in a slightly different order.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp     
Received on Wednesday, 8 August 2007 03:58:24 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:28 UTC