Urdu IDNs: Characters in domain names

Sarmad Hussain, at the Center for Research in Urdu Language Processing FAST National University, Pakistan, is looking at enabling Urdu IDNs based on ICANN recommendations, but this may lead to similar approaches in a number of other countries.

There are some aspects to Sarmad’s proposal, arising from the nature of the Arabic script used for Urdu, that raise some interesting questions about the way IDN works for this kind of language. These have to do with the choice of characters allowed in a domain name. 

For example, there is a suggestion that users should be able to use certain characters when writing a URI in Urdu which are then either removed (eg. vowel diacritics) or converted to other characters (eg. Arabic characters) during the conversion to punycode by a user agent plug-in.

This is not something that is normally relevant for English-only URIs, because of the relative simplicity of our alphabet. There is much more potential ambiguity in Urdu for use of characters. Note, however, that the proposals Sarmad is making are language-specific, not script-specific, ie. Arabic or Persian (also written with the Arabic script) would need some slightly different rules.

I find myself wondering whether you could use a plug-in to strip out or convert the characters while converting to punycode. People typing IDNs in Urdu would need to be aware of the need for a plug-in, and would still need to know how to type in IDNs if they found themselves using a browser that didn’t have the plug-in (eg. the businessman who is visiting a corporation in the US that prevents ad hoc downloads of software). On the one hand, I wonder whether we can expect a user who sees a URI on a hard copy brochure containing vowel diacritics to know what to do if their browser or mail client doesn’t support the plug-in. On the other hand, a person writing a clickable URI in HTML or an email would not be able to guarantee that users would have access to the plug-in. In that case, they would be unwise to use things like short vowel diacritics, since the user cannot easily change the link if they don’t have a plug-in. Imagine a vowelled IDN coming through in a plain text email, for example: the reader may need to edit the email text to get to the resource rather than just click on it. Not likely to be popular.

Another alternative is to do such removal and conversion of characters as part of the standard punycode conversion process. This, I suspect, would necessitate every browser to have access to standardised tables of characters that should be ignored or converted for any language. But there is an additional problem in that the language would need to be determined correctly before such rules were applied - that is, the language of the original URI. That too seems a bit difficult.

So I can see the need, but I’m not sure what the solution would be. I’m inclined to think that creating a plug-in might create more trouble than benefit, by replacing the problems of errors and ambiguities with the problems of uninteroperable IDNs.

There is an Excel file attached that lists which characters in the Arabic block would be appropriate for Urdu IDNs.  I will also list the characters below in a slightly different order.

ALLOWED

The following characters will allowed in the IRI but removed before conversion to punycode.

These characters are optional in Arabic script, though they can sometimes be useful for disambiguating pronunciation and meaning - particularly useful for Urdu, which has more vowel sounds than Arabic.‎

064B:‎‎   ً  ARABIC FATHATAN
064C:‎‎   ٌ  ARABIC DAMMATAN
064D:‎‎   ٍ  ARABIC KASRATAN
064E:‎‎   َ  ARABIC FATHA
064F:‎‎   ُ  ARABIC DAMMA
0650:‎‎   ِ  ARABIC KASRA
0651:‎‎   ّ  ARABIC SHADDA
0652:‎‎   ْ  ARABIC SUKUN
0655:‎‎   ٕ  ARABIC HAMZA BELOW
0656:‎‎   ٖ  ARABIC SUBSCRIPT ALEF
0658:‎‎   ٘  ARABIC MARK NOON GHUNNA
0670:‎‎   ٰ  ARABIC LETTER SUPERSCRIPT ALEF
0612:‎   ؒ  ARABIC SIGN RAHMATULLAH ALAYHE
0614:‎   ؔ  ARABIC SIGN TAKHALLUS

Space and zero-width non-joiner characters will also be allowed, but removed during the conversion to punycode.

Some other characters used in Arabic but not Urdu will be allowed but will be converted to a character used in Urdu during conversion to punycode. They are included in the set of allowed characters, however, to avoid confusion when they are used incorrectly.

0629:‎   ة  ARABIC LETTER TEH MARBUTA
0643:‎   ك  ARABIC LETTER KAF
0649:‎   ى  ARABIC LETTER ALEF MAKSURA
064A:‎   ي  ARABIC LETTER YEH
0660:‎   ٠  ARABIC-INDIC DIGIT ZERO
0661:‎   ١  ARABIC-INDIC DIGIT ONE
0662:‎   ٢  ARABIC-INDIC DIGIT TWO
0663:‎   ٣  ARABIC-INDIC DIGIT THREE
0664:‎   ٤  ARABIC-INDIC DIGIT FOUR
0665:‎   ٥  ARABIC-INDIC DIGIT FIVE
0666:‎   ٦  ARABIC-INDIC DIGIT SIX
0667:‎   ٧  ARABIC-INDIC DIGIT SEVEN
0668:‎   ٨  ARABIC-INDIC DIGIT EIGHT
0669:‎   ٩  ARABIC-INDIC DIGIT NINE
06C0:‎   ۀ  ARABIC LETTER HEH WITH YEH ABOVE
0625:‎   إ  ARABIC LETTER ALEF WITH HAMZA BELOW

European digits are also mapped to the Urdu digits.

The following will be permitted in the IRI, but decomposed before conversion to punycode:‎

FDF2:‎   ﷲ  ARABIC LIGATURE ALLAH ISOLATED FORM
FDF3:‎   ﷳ  ARABIC LIGATURE AKBAR ISOLATED FORM
FDF4:‎   ﷴ  ARABIC LIGATURE MOHAMMAD ISOLATED FORM
FDF5:‎   ﷵ  ARABIC LIGATURE SALAM ISOLATED FORM
FDF6:‎   ﷶ  ARABIC LIGATURE RASOUL ISOLATED FORM
FDF7:‎   ﷷ  ARABIC LIGATURE ALAYHE ISOLATED FORM
FDF8:‎   ﷸ  ARABIC LIGATURE WASALLAM ISOLATED FORM
FDF9:‎   ﷹ  ARABIC LIGATURE SALLA ISOLATED FORM
FDFA:‎   ﷺ  ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
FDFB:‎   ﷻ  ARABIC LIGATURE JALLAJALALOUHOU

Some combinations of diacritic and base character will be allowed:‎

0622:‎   آ  ARABIC LETTER ALEF WITH MADDA ABOVE
0623:‎   أ  ARABIC LETTER ALEF WITH HAMZA ABOVE
0624:‎   ؤ  ARABIC LETTER WAW WITH HAMZA ABOVE
06C2:‎   ۂ  ARABIC LETTER HEH GOAL WITH HAMZA ABOVE
06D3:‎   ۓ  ARABIC LETTER YEH BARREE WITH HAMZA ABOVE

The following mandatory honorific marks will be allowed:‎

‭0610:‎   ؐ  ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAM‬
0611:‎   ؑ  ARABIC SIGN ALAYHE ASSALLAM
0613:‎   ؓ  ARABIC SIGN RADI ALLAHOU ANHU

The following are basic alphabetic characters for Urdu, and will therefore be allowed:‎

0621:‎   ء  ARABIC LETTER HAMZA
0627:‎   ا  ARABIC LETTER ALEF
0628:‎   ب  ARABIC LETTER BEH
062A:‎   ت  ARABIC LETTER TEH
062B:‎   ث  ARABIC LETTER THEH
062C:‎   ج  ARABIC LETTER JEEM
062D:‎   ح  ARABIC LETTER HAH
062E:‎   خ  ARABIC LETTER KHAH
062F:‎   د  ARABIC LETTER DAL
0630:‎   ذ  ARABIC LETTER THAL
0631:‎   ر  ARABIC LETTER REH
0632:‎   ز  ARABIC LETTER ZAIN
0633:‎   س  ARABIC LETTER SEEN
0634:‎   ش  ARABIC LETTER SHEEN
0635:‎   ص  ARABIC LETTER SAD
0636:‎   ض  ARABIC LETTER DAD
0637:‎   ط  ARABIC LETTER TAH
0638:‎   ظ  ARABIC LETTER ZAH
0639:‎   ع  ARABIC LETTER AIN
063A:‎   غ  ARABIC LETTER GHAIN
0641:‎   ف  ARABIC LETTER FEH
0642:‎   ق  ARABIC LETTER QAF
0644:‎   ل  ARABIC LETTER LAM
0645:‎   م  ARABIC LETTER MEEM
0646:‎   ن  ARABIC LETTER NOON
0647:‎   ه  ARABIC LETTER HEH
0648:‎   و  ARABIC LETTER WAW
0679:‎   ٹ  ARABIC LETTER TTEH
067A:‎   ٺ  ARABIC LETTER TTEHEH
067B:‎   ٻ  ARABIC LETTER BEEH
067C:‎   ټ  ARABIC LETTER TEH WITH RING
067D:‎   ٽ  ARABIC LETTER TEH WITH THREE DOTS ABOVE DOWNWARDS
067E:‎   پ  ARABIC LETTER PEH
067F:‎   ٿ  ARABIC LETTER TEHEH
0680:‎   ڀ  ARABIC LETTER BEHEH
0681:‎   ځ  ARABIC LETTER HAH WITH HAMZA ABOVE
0682:‎   ڂ  ARABIC LETTER HAH WITH TWO DOTS VERTICAL ABOVE
0683:‎   ڃ  ARABIC LETTER NYEH
0684:‎   ڄ  ARABIC LETTER DYEH
0685:‎   څ  ARABIC LETTER HAH WITH THREE DOTS ABOVE
0686:‎   چ  ARABIC LETTER TCHEH
0687:‎   ڇ  ARABIC LETTER TCHEHEH
0688:‎   ڈ  ARABIC LETTER DDAL
0689:‎   ډ  ARABIC LETTER DAL WITH RING
068A:‎   ڊ  ARABIC LETTER DAL WITH DOT BELOW
068B:‎   ڋ  ARABIC LETTER DAL WITH DOT BELOW AND SMALL TAH
068C:‎   ڌ  ARABIC LETTER DAHAL
068D:‎   ڍ  ARABIC LETTER DDAHAL
068E:‎   ڎ  ARABIC LETTER DUL
068F:‎   ڏ  ARABIC LETTER DAL WITH THREE DOTS ABOVE DOWNWARDS
0690:‎   ڐ  ARABIC LETTER DAL WITH FOUR DOTS ABOVE
0691:‎   ڑ  ARABIC LETTER RREH
0692:‎   ڒ  ARABIC LETTER REH WITH SMALL V
0693:‎   ړ  ARABIC LETTER REH WITH RING
0694:‎   ڔ  ARABIC LETTER REH WITH DOT BELOW
0695:‎   ڕ  ARABIC LETTER REH WITH SMALL V BELOW
0696:‎   ږ  ARABIC LETTER REH WITH DOT BELOW AND DOT ABOVE
0697:‎   ڗ  ARABIC LETTER REH WITH TWO DOTS ABOVE
0698:‎   ژ  ARABIC LETTER JEH
0699:‎   ڙ  ARABIC LETTER REH WITH FOUR DOTS ABOVE
069A:‎   ښ  ARABIC LETTER SEEN WITH DOT BELOW AND DOT ABOVE
069B:‎   ڛ  ARABIC LETTER SEEN WITH THREE DOTS BELOW
069C:‎   ڜ  ARABIC LETTER SEEN WITH THREE DOTS BELOW AND THREE DOTS ABOVE
069D:‎   ڝ  ARABIC LETTER SAD WITH TWO DOTS BELOW
069E:‎   ڞ  ARABIC LETTER SAD WITH THREE DOTS ABOVE
069F:‎   ڟ  ARABIC LETTER TAH WITH THREE DOTS ABOVE
06A0:‎   ڠ  ARABIC LETTER AIN WITH THREE DOTS ABOVE
06A1:‎   ڡ  ARABIC LETTER DOTLESS FEH
06A2:‎   ڢ  ARABIC LETTER FEH WITH DOT MOVED BELOW
06A3:‎   ڣ  ARABIC LETTER FEH WITH DOT BELOW
06A4:‎   ڤ  ARABIC LETTER VEH
06A5:‎   ڥ  ARABIC LETTER FEH WITH THREE DOTS BELOW
06A6:‎   ڦ  ARABIC LETTER PEHEH
06A7:‎   ڧ  ARABIC LETTER QAF WITH DOT ABOVE
06A8:‎   ڨ  ARABIC LETTER QAF WITH THREE DOTS ABOVE
06A9:‎   ک  ARABIC LETTER KEHEH
06AA:‎   ڪ  ARABIC LETTER SWASH KAF
06AB:‎   ګ  ARABIC LETTER KAF WITH RING
06AC:‎   ڬ  ARABIC LETTER KAF WITH DOT ABOVE
06AD:‎   ڭ  ARABIC LETTER NG
06AE:‎   ڮ  ARABIC LETTER KAF WITH THREE DOTS BELOW
06AF:‎   گ  ARABIC LETTER GAF
06B0:‎   ڰ  ARABIC LETTER GAF WITH RING
06B1:‎   ڱ  ARABIC LETTER NGOEH
06B2:‎   ڲ  ARABIC LETTER GAF WITH TWO DOTS BELOW
06B3:‎   ڳ  ARABIC LETTER GUEH
06B4:‎   ڴ  ARABIC LETTER GAF WITH THREE DOTS ABOVE
06B5:‎   ڵ  ARABIC LETTER LAM WITH SMALL V
06B6:‎   ڶ  ARABIC LETTER LAM WITH DOT ABOVE
06B7:‎   ڷ  ARABIC LETTER LAM WITH THREE DOTS ABOVE
06B8:‎   ڸ  ARABIC LETTER LAM WITH THREE DOTS BELOW
06B9:‎   ڹ  ARABIC LETTER NOON WITH DOT BELOW
06BA:‎   ں  ARABIC LETTER NOON GHUNNA
06BB:‎   ڻ  ARABIC LETTER RNOON
06BC:‎   ڼ  ARABIC LETTER NOON WITH RING
06BD:‎   ڽ  ARABIC LETTER NOON WITH THREE DOTS ABOVE
06BE:‎   ھ  ARABIC LETTER HEH DOACHASHMEE
06C1:‎   ہ  ARABIC LETTER HEH GOAL
06C3:‎   ۃ  ARABIC LETTER TEH MARBUTA GOAL
06CC:‎   ی  ARABIC LETTER FARSI YEH
06D2:‎   ے  ARABIC LETTER YEH BARREE

The following combinations of base character and diacritic as a single character will also be allowed:‎

0622:‎   آ  ARABIC LETTER ALEF WITH MADDA ABOVE
0623:‎   أ  ARABIC LETTER ALEF WITH HAMZA ABOVE
0624:‎   ؤ  ARABIC LETTER WAW WITH HAMZA ABOVE
06C2:‎   ۂ  ARABIC LETTER HEH GOAL WITH HAMZA ABOVE
06D3:‎   ۓ  ARABIC LETTER YEH BARREE WITH HAMZA ABOVE

The following Urdu digits are allowed:‎

06F0:‎   ۰  EXTENDED ARABIC-INDIC DIGIT ZERO
06F1:‎   ۱  EXTENDED ARABIC-INDIC DIGIT ONE
06F2:‎   ۲  EXTENDED ARABIC-INDIC DIGIT TWO
06F3:‎   ۳  EXTENDED ARABIC-INDIC DIGIT THREE
06F4:‎   ۴  EXTENDED ARABIC-INDIC DIGIT FOUR
06F5:‎   ۵  EXTENDED ARABIC-INDIC DIGIT FIVE
06F6:‎   ۶  EXTENDED ARABIC-INDIC DIGIT SIX
06F7:‎   ۷  EXTENDED ARABIC-INDIC DIGIT SEVEN
06F8:‎   ۸  EXTENDED ARABIC-INDIC DIGIT EIGHT
06F9:‎   ۹  EXTENDED ARABIC-INDIC DIGIT NINE




NOT ALLOWED

The following are disallowed because they don't appear in plain text:‎ 

0600:‎   ؀  ARABIC NUMBER SIGN
0601:‎   ؁  ARABIC SIGN SANAH
060E:‎   ؎  ARABIC POETIC VERSE SIGN
060F:‎   ؏  ARABIC SIGN MISRA
0602:‎   ؂  ARABIC FOOTNOTE MARKER
0603:‎   ؃  ARABIC SIGN SAFHA
066D:‎   ٭  ARABIC FIVE POINTED STAR


The following are disallowed because they are mathematical signs or, in the case of the tatweel, just stylistic:‎

066A:‎   ٪  ARABIC PERCENT SIGN
066B:‎   ٫  ARABIC DECIMAL SEPARATOR
066C:‎   ٬  ARABIC THOUSANDS SEPARATOR
0640:‎   ـ  ARABIC TATWEEL

The following combining characters are allowed, since their absence or omission can change meaning:‎

0653:‎   ٓ  ARABIC MADDAH ABOVE
0654:‎   ٔ  ARABIC HAMZA ABOVE

The following are disallowed because they are punctuation:‎

061B:‎   ؛  ARABIC SEMICOLON
061F:‎   ؟  ARABIC QUESTION MARK
06D4:‎   ۔  ARABIC FULL STOP
060C:‎   ،  ARABIC COMMA
060D:‎   ؍  ARABIC DATE SEPARATOR


The remaining characters are not used for Urdu:‎

061E:‎   ؞  ARABIC TRIPLE DOT PUNCTUATION MARK
060B:‎   ؋  AFGHANI SIGN
0615:‎   ؕ  ARABIC SMALL HIGH TAH
0657:‎   ٗ  ARABIC INVERTED DAMMA
0659:‎   ٙ  ARABIC ZWARAKAY
065A:‎   ٚ  ARABIC VOWEL SIGN SMALL V ABOVE
065B:‎   ٛ  ARABIC VOWEL SIGN INVERTED SMALL V ABOVE
065C:‎   ٜ  ARABIC VOWEL SIGN DOT BELOW
065D:‎   ٝ  ARABIC REVERSED DAMMA
065E:‎   ٞ  ARABIC FATHA WITH TWO DOTS
066E:‎   ٮ  ARABIC LETTER DOTLESS BEH
066F:‎   ٯ  ARABIC LETTER DOTLESS QAF
0671:‎   ٱ  ARABIC LETTER ALEF WASLA
0672:‎   ٲ  ARABIC LETTER ALEF WITH WAVY HAMZA ABOVE
0673:‎   ٳ  ARABIC LETTER ALEF WITH WAVY HAMZA BELOW
0674:‎   ٴ  ARABIC LETTER HIGH HAMZA
0675:‎   ٵ  ARABIC LETTER HIGH HAMZA ALEF
0676:‎   ٶ  ARABIC LETTER HIGH HAMZA WAW
0677:‎   ٷ  ARABIC LETTER U WITH HAMZA ABOVE
0678:‎   ٸ  ARABIC LETTER HIGH HAMZA YEH
06BF:‎   ڿ  ARABIC LETTER TCHEH WITH DOT ABOVE
06C4:‎   ۄ  ARABIC LETTER WAW WITH RING
06C5:‎   ۅ  ARABIC LETTER KIRGHIZ OE
06C6:‎   ۆ  ARABIC LETTER OE
06C7:‎   ۇ  ARABIC LETTER U
06C8:‎   ۈ  ARABIC LETTER YU
06C9:‎   ۉ  ARABIC LETTER KIRGHIZ YU
06CA:‎   ۊ  ARABIC LETTER WAW WITH TWO DOTS ABOVE
06CB:‎   ۋ  ARABIC LETTER VE
06CD:‎   ۍ  ARABIC LETTER YEH WITH TAIL
06CE:‎   ێ  ARABIC LETTER YEH WITH SMALL V
06CF:‎   ۏ  ARABIC LETTER WAW WITH DOT ABOVE
06D0:‎   ې  ARABIC LETTER E
06D1:‎   ۑ  ARABIC LETTER YEH WITH THREE DOTS BELOW
06D5:‎   ە  ARABIC LETTER AE
06D6:‎   ۖ  ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA
06D7:‎   ۗ  ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA
06D8:‎   ۘ  ARABIC SMALL HIGH MEEM INITIAL FORM
06D9:‎   ۙ  ARABIC SMALL HIGH LAM ALEF
06DA:‎   ۚ  ARABIC SMALL HIGH JEEM
06DB:‎   ۛ  ARABIC SMALL HIGH THREE DOTS
06DC:‎   ۜ  ARABIC SMALL HIGH SEEN
06DD:‎   ۝  ARABIC END OF AYAH
06DE:‎   ۞  ARABIC START OF RUB EL HIZB
06DF:‎   ۟  ARABIC SMALL HIGH ROUNDED ZERO
06E0:‎   ۠  ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO
06E1:‎   ۡ  ARABIC SMALL HIGH DOTLESS HEAD OF KHAH
06E2:‎   ۢ  ARABIC SMALL HIGH MEEM ISOLATED FORM
06E3:‎   ۣ  ARABIC SMALL LOW SEEN
06E4:‎   ۤ  ARABIC SMALL HIGH MADDA
06E5:‎   ۥ  ARABIC SMALL WAW
06E6:‎   ۦ  ARABIC SMALL YEH
06E7:‎   ۧ  ARABIC SMALL HIGH YEH
06E8:‎   ۨ  ARABIC SMALL HIGH NOON
06E9:‎   ۩  ARABIC PLACE OF SAJDAH
06EA:‎   ۪  ARABIC EMPTY CENTRE LOW STOP
06EB:‎   ۫  ARABIC EMPTY CENTRE HIGH STOP
06EC:‎   ۬  ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE
06ED:‎   ۭ  ARABIC SMALL LOW MEEM
06EE:‎   ۮ  ARABIC LETTER DAL WITH INVERTED V
06EF:‎   ۯ  ARABIC LETTER REH WITH INVERTED V
06FA:‎   ۺ  ARABIC LETTER SHEEN WITH DOT BELOW
06FB:‎   ۻ  ARABIC LETTER DAD WITH DOT BELOW
06FC:‎   ۼ  ARABIC LETTER GHAIN WITH DOT BELOW
06FD:‎   ۽  ARABIC SIGN SINDHI AMPERSAND
06FE:‎   ۾  ARABIC SIGN SINDHI POSTPOSITION MEN



‬
============
Richard Ishida
Internationalization Lead
W3C (World Wide Web Consortium)
 
http://www.w3.org/People/Ishida/
http://www.w3.org/International/
http://people.w3.org/rishida/blog/
http://www.flickr.com/photos/ishida/
 

Received on Monday, 30 July 2007 20:09:30 UTC