W3C home > Mailing lists > Public > www-international@w3.org > July to September 2007

RE: Urdu IDNs: Characters in domain names

From: Jonathan Rosenne <rosennej@qsm.co.il>
Date: Mon, 30 Jul 2007 23:23:42 +0300
To: "'Richard Ishida'" <ishida@w3.org>, <www-international@w3.org>, <public-iri@w3.org>
Cc: "'Sarmad Hussain'" <sarmad.hussain@nu.edu.pk>
Message-ID: <000c01c7d2e7$891065a0$9b3130e0$@co.il>

Sarmad's mail is specific about TLDs. TLDs are specific and limited in number, and it should be possible to determine national language TLDs for most countries while avoiding the problems mentioned below.

The use of URLs in the national language within a TLD of the same language should be determined at the country or language level and not necessarily by an international organization.

I am sorry to have to say this, but the international organizations controlling the internet have not shown in the past sufficient understanding for people who do not use the Latin script and their problems. By show I mean action, not talk.

At least for Hebrew, I do not think that vowels should be removed. For native users of the language vowels that may or may not be present are natural. A site may register itself twice, with and without vowels, or duplicate their pages, or handle it in their server, if they want to.

Jony

> -----Original Message-----
> From: public-iri-request@w3.org [mailto:public-iri-request@w3.org] On
> Behalf Of Richard Ishida
> Sent: Monday, July 30, 2007 11:11 PM
> To: www-international@w3.org; public-iri@w3.org
> Cc: 'Sarmad Hussain'
> Subject: Urdu IDNs: Characters in domain names
> 
> Sarmad Hussain, at the Center for Research in Urdu Language Processing
> FAST National University, Pakistan, is looking at enabling Urdu IDNs
> based on ICANN recommendations, but this may lead to similar approaches
> in a number of other countries.
> 
> There are some aspects to Sarmads proposal, arising from the nature of
> the Arabic script used for Urdu, that raise some interesting questions
> about the way IDN works for this kind of language. These have to do
> with the choice of characters allowed in a domain name.
> 
> For example, there is a suggestion that users should be able to use
> certain characters when writing a URI in Urdu which are then either
> removed (eg. vowel diacritics) or converted to other characters (eg.
> Arabic characters) during the conversion to punycode by a user agent
> plug-in.
> 
> This is not something that is normally relevant for English-only URIs,
> because of the relative simplicity of our alphabet. There is much more
> potential ambiguity in Urdu for use of characters. Note, however, that
> the proposals Sarmad is making are language-specific, not script-
> specific, ie. Arabic or Persian (also written with the Arabic script)
> would need some slightly different rules.
> 
> I find myself wondering whether you could use a plug-in to strip out or
> convert the characters while converting to punycode. People typing IDNs
> in Urdu would need to be aware of the need for a plug-in, and would
> still need to know how to type in IDNs if they found themselves using a
> browser that didnt have the plug-in (eg. the businessman who is
> visiting a corporation in the US that prevents ad hoc downloads of
> software). On the one hand, I wonder whether we can expect a user who
> sees a URI on a hard copy brochure containing vowel diacritics to know
> what to do if their browser or mail client doesnt support the plug-in.
> On the other hand, a person writing a clickable URI in HTML or an email
> would not be able to guarantee that users would have access to the
> plug-in. In that case, they would be unwise to use things like short
> vowel diacritics, since the user cannot easily change the link if they
> dont have a plug-in. Imagine a vowelled IDN coming through in a plain
> text email, for example: the reader may need to edit the email text to
> get to the resource rather than just click on it. Not likely to be
> popular.
> 
> Another alternative is to do such removal and conversion of characters
> as part of the standard punycode conversion process. This, I suspect,
> would necessitate every browser to have access to standardised tables
> of characters that should be ignored or converted for any language. But
> there is an additional problem in that the language would need to be
> determined correctly before such rules were applied - that is, the
> language of the original URI. That too seems a bit difficult.
> 
> So I can see the need, but Im not sure what the solution would be. Im
> inclined to think that creating a plug-in might create more trouble
> than benefit, by replacing the problems of errors and ambiguities with
> the problems of uninteroperable IDNs.
> 
> There is an Excel file attached that lists which characters in the
> Arabic block would be appropriate for Urdu IDNs.  I will also list the
> characters below in a slightly different order.
> 
> ALLOWED
> 
> The following characters will allowed in the IRI but removed before
> conversion to punycode.
> 
> These characters are optional in Arabic script, though they can
> sometimes be useful for disambiguating pronunciation and meaning -
> particularly useful for Urdu, which has more vowel sounds than Arabic.
> 
> 064B:   ?  ARABIC FATHATAN
> 064C:   ?  ARABIC DAMMATAN
> 064D:   ?  ARABIC KASRATAN
> 064E:   ?  ARABIC FATHA
> 064F:   ?  ARABIC DAMMA
> 0650:   ?  ARABIC KASRA
> 0651:   ?  ARABIC SHADDA
> 0652:   ?  ARABIC SUKUN
> 0655:   ?  ARABIC HAMZA BELOW
> 0656:   ?  ARABIC SUBSCRIPT ALEF
> 0658:   ?  ARABIC MARK NOON GHUNNA
> 0670:   ?  ARABIC LETTER SUPERSCRIPT ALEF
> 0612:   ?  ARABIC SIGN RAHMATULLAH ALAYHE
> 0614:   ?  ARABIC SIGN TAKHALLUS
> 
> Space and zero-width non-joiner characters will also be allowed, but
> removed during the conversion to punycode.
> 
> Some other characters used in Arabic but not Urdu will be allowed but
> will be converted to a character used in Urdu during conversion to
> punycode. They are included in the set of allowed characters, however,
> to avoid confusion when they are used incorrectly.
> 
> 0629:   ?  ARABIC LETTER TEH MARBUTA
> 0643:   ?  ARABIC LETTER KAF
> 0649:   ?  ARABIC LETTER ALEF MAKSURA
> 064A:   ?  ARABIC LETTER YEH
> 0660:   ?  ARABIC-INDIC DIGIT ZERO
> 0661:   ?  ARABIC-INDIC DIGIT ONE
> 0662:   ?  ARABIC-INDIC DIGIT TWO
> 0663:   ?  ARABIC-INDIC DIGIT THREE
> 0664:   ?  ARABIC-INDIC DIGIT FOUR
> 0665:   ?  ARABIC-INDIC DIGIT FIVE
> 0666:   ?  ARABIC-INDIC DIGIT SIX
> 0667:   ?  ARABIC-INDIC DIGIT SEVEN
> 0668:   ?  ARABIC-INDIC DIGIT EIGHT
> 0669:   ?  ARABIC-INDIC DIGIT NINE
> 06C0:   ?  ARABIC LETTER HEH WITH YEH ABOVE
> 0625:   ?  ARABIC LETTER ALEF WITH HAMZA BELOW
> 
> European digits are also mapped to the Urdu digits.
> 
> The following will be permitted in the IRI, but decomposed before
> conversion to punycode:
> 
> FDF2:   ?  ARABIC LIGATURE ALLAH ISOLATED FORM
> FDF3:   ?  ARABIC LIGATURE AKBAR ISOLATED FORM
> FDF4:   ?  ARABIC LIGATURE MOHAMMAD ISOLATED FORM
> FDF5:   ?  ARABIC LIGATURE SALAM ISOLATED FORM
> FDF6:   ?  ARABIC LIGATURE RASOUL ISOLATED FORM
> FDF7:   ?  ARABIC LIGATURE ALAYHE ISOLATED FORM
> FDF8:   ?  ARABIC LIGATURE WASALLAM ISOLATED FORM
> FDF9:   ?  ARABIC LIGATURE SALLA ISOLATED FORM
> FDFA:   ?  ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
> FDFB:   ?  ARABIC LIGATURE JALLAJALALOUHOU
> 
> Some combinations of diacritic and base character will be allowed:
> 
> 0622:   ?  ARABIC LETTER ALEF WITH MADDA ABOVE
> 0623:   ?  ARABIC LETTER ALEF WITH HAMZA ABOVE
> 0624:   ?  ARABIC LETTER WAW WITH HAMZA ABOVE
> 06C2:   ?  ARABIC LETTER HEH GOAL WITH HAMZA ABOVE
> 06D3:   ?  ARABIC LETTER YEH BARREE WITH HAMZA ABOVE
> 
> The following mandatory honorific marks will be allowed:
> 
> ?0610:   ?  ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAM?
> 0611:   ?  ARABIC SIGN ALAYHE ASSALLAM
> 0613:   ?  ARABIC SIGN RADI ALLAHOU ANHU
> 
> The following are basic alphabetic characters for Urdu, and will
> therefore be allowed:
> 
> 0621:   ?  ARABIC LETTER HAMZA
> 0627:   ?  ARABIC LETTER ALEF
> 0628:   ?  ARABIC LETTER BEH
> 062A:   ?  ARABIC LETTER TEH
> 062B:   ?  ARABIC LETTER THEH
> 062C:   ?  ARABIC LETTER JEEM
> 062D:   ?  ARABIC LETTER HAH
> 062E:   ?  ARABIC LETTER KHAH
> 062F:   ?  ARABIC LETTER DAL
> 0630:   ?  ARABIC LETTER THAL
> 0631:   ?  ARABIC LETTER REH
> 0632:   ?  ARABIC LETTER ZAIN
> 0633:   ?  ARABIC LETTER SEEN
> 0634:   ?  ARABIC LETTER SHEEN
> 0635:   ?  ARABIC LETTER SAD
> 0636:   ?  ARABIC LETTER DAD
> 0637:   ?  ARABIC LETTER TAH
> 0638:   ?  ARABIC LETTER ZAH
> 0639:   ?  ARABIC LETTER AIN
> 063A:   ?  ARABIC LETTER GHAIN
> 0641:   ?  ARABIC LETTER FEH
> 0642:   ?  ARABIC LETTER QAF
> 0644:   ?  ARABIC LETTER LAM
> 0645:   ?  ARABIC LETTER MEEM
> 0646:   ?  ARABIC LETTER NOON
> 0647:   ?  ARABIC LETTER HEH
> 0648:   ?  ARABIC LETTER WAW
> 0679:   ?  ARABIC LETTER TTEH
> 067A:   ?  ARABIC LETTER TTEHEH
> 067B:   ?  ARABIC LETTER BEEH
> 067C:   ?  ARABIC LETTER TEH WITH RING
> 067D:   ?  ARABIC LETTER TEH WITH THREE DOTS ABOVE DOWNWARDS
> 067E:   ?  ARABIC LETTER PEH
> 067F:   ?  ARABIC LETTER TEHEH
> 0680:   ?  ARABIC LETTER BEHEH
> 0681:   ?  ARABIC LETTER HAH WITH HAMZA ABOVE
> 0682:   ?  ARABIC LETTER HAH WITH TWO DOTS VERTICAL ABOVE
> 0683:   ?  ARABIC LETTER NYEH
> 0684:   ?  ARABIC LETTER DYEH
> 0685:   ?  ARABIC LETTER HAH WITH THREE DOTS ABOVE
> 0686:   ?  ARABIC LETTER TCHEH
> 0687:   ?  ARABIC LETTER TCHEHEH
> 0688:   ?  ARABIC LETTER DDAL
> 0689:   ?  ARABIC LETTER DAL WITH RING
> 068A:   ?  ARABIC LETTER DAL WITH DOT BELOW
> 068B:   ?  ARABIC LETTER DAL WITH DOT BELOW AND SMALL TAH
> 068C:   ?  ARABIC LETTER DAHAL
> 068D:   ?  ARABIC LETTER DDAHAL
> 068E:   ?  ARABIC LETTER DUL
> 068F:   ?  ARABIC LETTER DAL WITH THREE DOTS ABOVE DOWNWARDS
> 0690:   ?  ARABIC LETTER DAL WITH FOUR DOTS ABOVE
> 0691:   ?  ARABIC LETTER RREH
> 0692:   ?  ARABIC LETTER REH WITH SMALL V
> 0693:   ?  ARABIC LETTER REH WITH RING
> 0694:   ?  ARABIC LETTER REH WITH DOT BELOW
> 0695:   ?  ARABIC LETTER REH WITH SMALL V BELOW
> 0696:   ?  ARABIC LETTER REH WITH DOT BELOW AND DOT ABOVE
> 0697:   ?  ARABIC LETTER REH WITH TWO DOTS ABOVE
> 0698:   ?  ARABIC LETTER JEH
> 0699:   ?  ARABIC LETTER REH WITH FOUR DOTS ABOVE
> 069A:   ?  ARABIC LETTER SEEN WITH DOT BELOW AND DOT ABOVE
> 069B:   ?  ARABIC LETTER SEEN WITH THREE DOTS BELOW
> 069C:   ?  ARABIC LETTER SEEN WITH THREE DOTS BELOW AND THREE DOTS
> ABOVE
> 069D:   ?  ARABIC LETTER SAD WITH TWO DOTS BELOW
> 069E:   ?  ARABIC LETTER SAD WITH THREE DOTS ABOVE
> 069F:   ?  ARABIC LETTER TAH WITH THREE DOTS ABOVE
> 06A0:   ?  ARABIC LETTER AIN WITH THREE DOTS ABOVE
> 06A1:   ?  ARABIC LETTER DOTLESS FEH
> 06A2:   ?  ARABIC LETTER FEH WITH DOT MOVED BELOW
> 06A3:   ?  ARABIC LETTER FEH WITH DOT BELOW
> 06A4:   ?  ARABIC LETTER VEH
> 06A5:   ?  ARABIC LETTER FEH WITH THREE DOTS BELOW
> 06A6:   ?  ARABIC LETTER PEHEH
> 06A7:   ?  ARABIC LETTER QAF WITH DOT ABOVE
> 06A8:   ?  ARABIC LETTER QAF WITH THREE DOTS ABOVE
> 06A9:   ?  ARABIC LETTER KEHEH
> 06AA:   ?  ARABIC LETTER SWASH KAF
> 06AB:   ?  ARABIC LETTER KAF WITH RING
> 06AC:   ?  ARABIC LETTER KAF WITH DOT ABOVE
> 06AD:   ?  ARABIC LETTER NG
> 06AE:   ?  ARABIC LETTER KAF WITH THREE DOTS BELOW
> 06AF:   ?  ARABIC LETTER GAF
> 06B0:   ?  ARABIC LETTER GAF WITH RING
> 06B1:   ?  ARABIC LETTER NGOEH
> 06B2:   ?  ARABIC LETTER GAF WITH TWO DOTS BELOW
> 06B3:   ?  ARABIC LETTER GUEH
> 06B4:   ?  ARABIC LETTER GAF WITH THREE DOTS ABOVE
> 06B5:   ?  ARABIC LETTER LAM WITH SMALL V
> 06B6:   ?  ARABIC LETTER LAM WITH DOT ABOVE
> 06B7:   ?  ARABIC LETTER LAM WITH THREE DOTS ABOVE
> 06B8:   ?  ARABIC LETTER LAM WITH THREE DOTS BELOW
> 06B9:   ?  ARABIC LETTER NOON WITH DOT BELOW
> 06BA:   ?  ARABIC LETTER NOON GHUNNA
> 06BB:   ?  ARABIC LETTER RNOON
> 06BC:   ?  ARABIC LETTER NOON WITH RING
> 06BD:   ?  ARABIC LETTER NOON WITH THREE DOTS ABOVE
> 06BE:   ?  ARABIC LETTER HEH DOACHASHMEE
> 06C1:   ?  ARABIC LETTER HEH GOAL
> 06C3:   ?  ARABIC LETTER TEH MARBUTA GOAL
> 06CC:   ?  ARABIC LETTER FARSI YEH
> 06D2:   ?  ARABIC LETTER YEH BARREE
> 
> The following combinations of base character and diacritic as a single
> character will also be allowed:
> 
> 0622:   ?  ARABIC LETTER ALEF WITH MADDA ABOVE
> 0623:   ?  ARABIC LETTER ALEF WITH HAMZA ABOVE
> 0624:   ?  ARABIC LETTER WAW WITH HAMZA ABOVE
> 06C2:   ?  ARABIC LETTER HEH GOAL WITH HAMZA ABOVE
> 06D3:   ?  ARABIC LETTER YEH BARREE WITH HAMZA ABOVE
> 
> The following Urdu digits are allowed:
> 
> 06F0:   ?  EXTENDED ARABIC-INDIC DIGIT ZERO
> 06F1:   ?  EXTENDED ARABIC-INDIC DIGIT ONE
> 06F2:   ?  EXTENDED ARABIC-INDIC DIGIT TWO
> 06F3:   ?  EXTENDED ARABIC-INDIC DIGIT THREE
> 06F4:   ?  EXTENDED ARABIC-INDIC DIGIT FOUR
> 06F5:   ?  EXTENDED ARABIC-INDIC DIGIT FIVE
> 06F6:   ?  EXTENDED ARABIC-INDIC DIGIT SIX
> 06F7:   ?  EXTENDED ARABIC-INDIC DIGIT SEVEN
> 06F8:   ?  EXTENDED ARABIC-INDIC DIGIT EIGHT
> 06F9:   ?  EXTENDED ARABIC-INDIC DIGIT NINE
> 
> 
> 
> 
> NOT ALLOWED
> 
> The following are disallowed because they don't appear in plain text:
> 
> 0600:   ?  ARABIC NUMBER SIGN
> 0601:   ?  ARABIC SIGN SANAH
> 060E:   ?  ARABIC POETIC VERSE SIGN
> 060F:   ?  ARABIC SIGN MISRA
> 0602:   ?  ARABIC FOOTNOTE MARKER
> 0603:   ?  ARABIC SIGN SAFHA
> 066D:   ?  ARABIC FIVE POINTED STAR
> 
> 
> The following are disallowed because they are mathematical signs or, in
> the case of the tatweel, just stylistic:
> 
> 066A:   ?  ARABIC PERCENT SIGN
> 066B:   ?  ARABIC DECIMAL SEPARATOR
> 066C:   ?  ARABIC THOUSANDS SEPARATOR
> 0640:   ?  ARABIC TATWEEL
> 
> The following combining characters are allowed, since their absence or
> omission can change meaning:
> 
> 0653:   ?  ARABIC MADDAH ABOVE
> 0654:   ?  ARABIC HAMZA ABOVE
> 
> The following are disallowed because they are punctuation:
> 
> 061B:   ?  ARABIC SEMICOLON
> 061F:   ?  ARABIC QUESTION MARK
> 06D4:   ?  ARABIC FULL STOP
> 060C:   ?  ARABIC COMMA
> 060D:   ?  ARABIC DATE SEPARATOR
> 
> 
> The remaining characters are not used for Urdu:
> 
> 061E:   ?  ARABIC TRIPLE DOT PUNCTUATION MARK
> 060B:   ?  AFGHANI SIGN
> 0615:   ?  ARABIC SMALL HIGH TAH
> 0657:   ?  ARABIC INVERTED DAMMA
> 0659:   ?  ARABIC ZWARAKAY
> 065A:   ?  ARABIC VOWEL SIGN SMALL V ABOVE
> 065B:   ?  ARABIC VOWEL SIGN INVERTED SMALL V ABOVE
> 065C:   ?  ARABIC VOWEL SIGN DOT BELOW
> 065D:   ?  ARABIC REVERSED DAMMA
> 065E:   ?  ARABIC FATHA WITH TWO DOTS
> 066E:   ?  ARABIC LETTER DOTLESS BEH
> 066F:   ?  ARABIC LETTER DOTLESS QAF
> 0671:   ?  ARABIC LETTER ALEF WASLA
> 0672:   ?  ARABIC LETTER ALEF WITH WAVY HAMZA ABOVE
> 0673:   ?  ARABIC LETTER ALEF WITH WAVY HAMZA BELOW
> 0674:   ?  ARABIC LETTER HIGH HAMZA
> 0675:   ?  ARABIC LETTER HIGH HAMZA ALEF
> 0676:   ?  ARABIC LETTER HIGH HAMZA WAW
> 0677:   ?  ARABIC LETTER U WITH HAMZA ABOVE
> 0678:   ?  ARABIC LETTER HIGH HAMZA YEH
> 06BF:   ?  ARABIC LETTER TCHEH WITH DOT ABOVE
> 06C4:   ?  ARABIC LETTER WAW WITH RING
> 06C5:   ?  ARABIC LETTER KIRGHIZ OE
> 06C6:   ?  ARABIC LETTER OE
> 06C7:   ?  ARABIC LETTER U
> 06C8:   ?  ARABIC LETTER YU
> 06C9:   ?  ARABIC LETTER KIRGHIZ YU
> 06CA:   ?  ARABIC LETTER WAW WITH TWO DOTS ABOVE
> 06CB:   ?  ARABIC LETTER VE
> 06CD:   ?  ARABIC LETTER YEH WITH TAIL
> 06CE:   ?  ARABIC LETTER YEH WITH SMALL V
> 06CF:   ?  ARABIC LETTER WAW WITH DOT ABOVE
> 06D0:   ?  ARABIC LETTER E
> 06D1:   ?  ARABIC LETTER YEH WITH THREE DOTS BELOW
> 06D5:   ?  ARABIC LETTER AE
> 06D6:   ?  ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA
> 06D7:   ?  ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA
> 06D8:   ?  ARABIC SMALL HIGH MEEM INITIAL FORM
> 06D9:   ?  ARABIC SMALL HIGH LAM ALEF
> 06DA:   ?  ARABIC SMALL HIGH JEEM
> 06DB:   ?  ARABIC SMALL HIGH THREE DOTS
> 06DC:   ?  ARABIC SMALL HIGH SEEN
> 06DD:   ?  ARABIC END OF AYAH
> 06DE:   ?  ARABIC START OF RUB EL HIZB
> 06DF:   ?  ARABIC SMALL HIGH ROUNDED ZERO
> 06E0:   ?  ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO
> 06E1:   ?  ARABIC SMALL HIGH DOTLESS HEAD OF KHAH
> 06E2:   ?  ARABIC SMALL HIGH MEEM ISOLATED FORM
> 06E3:   ?  ARABIC SMALL LOW SEEN
> 06E4:   ?  ARABIC SMALL HIGH MADDA
> 06E5:   ?  ARABIC SMALL WAW
> 06E6:   ?  ARABIC SMALL YEH
> 06E7:   ?  ARABIC SMALL HIGH YEH
> 06E8:   ?  ARABIC SMALL HIGH NOON
> 06E9:   ?  ARABIC PLACE OF SAJDAH
> 06EA:   ?  ARABIC EMPTY CENTRE LOW STOP
> 06EB:   ?  ARABIC EMPTY CENTRE HIGH STOP
> 06EC:   ?  ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE
> 06ED:   ?  ARABIC SMALL LOW MEEM
> 06EE:   ?  ARABIC LETTER DAL WITH INVERTED V
> 06EF:   ?  ARABIC LETTER REH WITH INVERTED V
> 06FA:   ?  ARABIC LETTER SHEEN WITH DOT BELOW
> 06FB:   ?  ARABIC LETTER DAD WITH DOT BELOW
> 06FC:   ?  ARABIC LETTER GHAIN WITH DOT BELOW
> 06FD:   ?  ARABIC SIGN SINDHI AMPERSAND
> 06FE:   ?  ARABIC SIGN SINDHI POSTPOSITION MEN
> 
> 
> 
> ?
> ============
> Richard Ishida
> Internationalization Lead
> W3C (World Wide Web Consortium)
> 
> http://www.w3.org/People/Ishida/
> http://www.w3.org/International/
> http://people.w3.org/rishida/blog/
> http://www.flickr.com/photos/ishida/
> 
Received on Monday, 30 July 2007 20:24:10 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:14 GMT