- From: Richard Ishida <ishida@w3.org>
- Date: Mon, 30 Jul 2007 21:10:54 +0100
- To: <www-international@w3.org>, <public-iri@w3.org>
- Cc: "'Sarmad Hussain'" <sarmad.hussain@nu.edu.pk>
- Message-ID: <02dc01c7d2e5$bc0cd5d0$6501a8c0@rishida>
Sarmad Hussain, at the Center for Research in Urdu Language Processing FAST National University, Pakistan, is looking at enabling Urdu IDNs based on ICANN recommendations, but this may lead to similar approaches in a number of other countries. There are some aspects to Sarmad’s proposal, arising from the nature of the Arabic script used for Urdu, that raise some interesting questions about the way IDN works for this kind of language. These have to do with the choice of characters allowed in a domain name. For example, there is a suggestion that users should be able to use certain characters when writing a URI in Urdu which are then either removed (eg. vowel diacritics) or converted to other characters (eg. Arabic characters) during the conversion to punycode by a user agent plug-in. This is not something that is normally relevant for English-only URIs, because of the relative simplicity of our alphabet. There is much more potential ambiguity in Urdu for use of characters. Note, however, that the proposals Sarmad is making are language-specific, not script-specific, ie. Arabic or Persian (also written with the Arabic script) would need some slightly different rules. I find myself wondering whether you could use a plug-in to strip out or convert the characters while converting to punycode. People typing IDNs in Urdu would need to be aware of the need for a plug-in, and would still need to know how to type in IDNs if they found themselves using a browser that didn’t have the plug-in (eg. the businessman who is visiting a corporation in the US that prevents ad hoc downloads of software). On the one hand, I wonder whether we can expect a user who sees a URI on a hard copy brochure containing vowel diacritics to know what to do if their browser or mail client doesn’t support the plug-in. On the other hand, a person writing a clickable URI in HTML or an email would not be able to guarantee that users would have access to the plug-in. In that case, they would be unwise to use things like short vowel diacritics, since the user cannot easily change the link if they don’t have a plug-in. Imagine a vowelled IDN coming through in a plain text email, for example: the reader may need to edit the email text to get to the resource rather than just click on it. Not likely to be popular. Another alternative is to do such removal and conversion of characters as part of the standard punycode conversion process. This, I suspect, would necessitate every browser to have access to standardised tables of characters that should be ignored or converted for any language. But there is an additional problem in that the language would need to be determined correctly before such rules were applied - that is, the language of the original URI. That too seems a bit difficult. So I can see the need, but I’m not sure what the solution would be. I’m inclined to think that creating a plug-in might create more trouble than benefit, by replacing the problems of errors and ambiguities with the problems of uninteroperable IDNs. There is an Excel file attached that lists which characters in the Arabic block would be appropriate for Urdu IDNs. I will also list the characters below in a slightly different order. ALLOWED The following characters will allowed in the IRI but removed before conversion to punycode. These characters are optional in Arabic script, though they can sometimes be useful for disambiguating pronunciation and meaning - particularly useful for Urdu, which has more vowel sounds than Arabic. 064B: ً ARABIC FATHATAN 064C: ٌ ARABIC DAMMATAN 064D: ٍ ARABIC KASRATAN 064E: َ ARABIC FATHA 064F: ُ ARABIC DAMMA 0650: ِ ARABIC KASRA 0651: ّ ARABIC SHADDA 0652: ْ ARABIC SUKUN 0655: ٕ ARABIC HAMZA BELOW 0656: ٖ ARABIC SUBSCRIPT ALEF 0658: ٘ ARABIC MARK NOON GHUNNA 0670: ٰ ARABIC LETTER SUPERSCRIPT ALEF 0612: ؒ ARABIC SIGN RAHMATULLAH ALAYHE 0614: ؔ ARABIC SIGN TAKHALLUS Space and zero-width non-joiner characters will also be allowed, but removed during the conversion to punycode. Some other characters used in Arabic but not Urdu will be allowed but will be converted to a character used in Urdu during conversion to punycode. They are included in the set of allowed characters, however, to avoid confusion when they are used incorrectly. 0629: ة ARABIC LETTER TEH MARBUTA 0643: ك ARABIC LETTER KAF 0649: ى ARABIC LETTER ALEF MAKSURA 064A: ي ARABIC LETTER YEH 0660: ٠ ARABIC-INDIC DIGIT ZERO 0661: ١ ARABIC-INDIC DIGIT ONE 0662: ٢ ARABIC-INDIC DIGIT TWO 0663: ٣ ARABIC-INDIC DIGIT THREE 0664: ٤ ARABIC-INDIC DIGIT FOUR 0665: ٥ ARABIC-INDIC DIGIT FIVE 0666: ٦ ARABIC-INDIC DIGIT SIX 0667: ٧ ARABIC-INDIC DIGIT SEVEN 0668: ٨ ARABIC-INDIC DIGIT EIGHT 0669: ٩ ARABIC-INDIC DIGIT NINE 06C0: ۀ ARABIC LETTER HEH WITH YEH ABOVE 0625: إ ARABIC LETTER ALEF WITH HAMZA BELOW European digits are also mapped to the Urdu digits. The following will be permitted in the IRI, but decomposed before conversion to punycode: FDF2: ﷲ ARABIC LIGATURE ALLAH ISOLATED FORM FDF3: ﷳ ARABIC LIGATURE AKBAR ISOLATED FORM FDF4: ﷴ ARABIC LIGATURE MOHAMMAD ISOLATED FORM FDF5: ﷵ ARABIC LIGATURE SALAM ISOLATED FORM FDF6: ﷶ ARABIC LIGATURE RASOUL ISOLATED FORM FDF7: ﷷ ARABIC LIGATURE ALAYHE ISOLATED FORM FDF8: ﷸ ARABIC LIGATURE WASALLAM ISOLATED FORM FDF9: ﷹ ARABIC LIGATURE SALLA ISOLATED FORM FDFA: ﷺ ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM FDFB: ﷻ ARABIC LIGATURE JALLAJALALOUHOU Some combinations of diacritic and base character will be allowed: 0622: آ ARABIC LETTER ALEF WITH MADDA ABOVE 0623: أ ARABIC LETTER ALEF WITH HAMZA ABOVE 0624: ؤ ARABIC LETTER WAW WITH HAMZA ABOVE 06C2: ۂ ARABIC LETTER HEH GOAL WITH HAMZA ABOVE 06D3: ۓ ARABIC LETTER YEH BARREE WITH HAMZA ABOVE The following mandatory honorific marks will be allowed: 0610: ؐ ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAM 0611: ؑ ARABIC SIGN ALAYHE ASSALLAM 0613: ؓ ARABIC SIGN RADI ALLAHOU ANHU The following are basic alphabetic characters for Urdu, and will therefore be allowed: 0621: ء ARABIC LETTER HAMZA 0627: ا ARABIC LETTER ALEF 0628: ب ARABIC LETTER BEH 062A: ت ARABIC LETTER TEH 062B: ث ARABIC LETTER THEH 062C: ج ARABIC LETTER JEEM 062D: ح ARABIC LETTER HAH 062E: خ ARABIC LETTER KHAH 062F: د ARABIC LETTER DAL 0630: ذ ARABIC LETTER THAL 0631: ر ARABIC LETTER REH 0632: ز ARABIC LETTER ZAIN 0633: س ARABIC LETTER SEEN 0634: ش ARABIC LETTER SHEEN 0635: ص ARABIC LETTER SAD 0636: ض ARABIC LETTER DAD 0637: ط ARABIC LETTER TAH 0638: ظ ARABIC LETTER ZAH 0639: ع ARABIC LETTER AIN 063A: غ ARABIC LETTER GHAIN 0641: ف ARABIC LETTER FEH 0642: ق ARABIC LETTER QAF 0644: ل ARABIC LETTER LAM 0645: م ARABIC LETTER MEEM 0646: ن ARABIC LETTER NOON 0647: ه ARABIC LETTER HEH 0648: و ARABIC LETTER WAW 0679: ٹ ARABIC LETTER TTEH 067A: ٺ ARABIC LETTER TTEHEH 067B: ٻ ARABIC LETTER BEEH 067C: ټ ARABIC LETTER TEH WITH RING 067D: ٽ ARABIC LETTER TEH WITH THREE DOTS ABOVE DOWNWARDS 067E: پ ARABIC LETTER PEH 067F: ٿ ARABIC LETTER TEHEH 0680: ڀ ARABIC LETTER BEHEH 0681: ځ ARABIC LETTER HAH WITH HAMZA ABOVE 0682: ڂ ARABIC LETTER HAH WITH TWO DOTS VERTICAL ABOVE 0683: ڃ ARABIC LETTER NYEH 0684: ڄ ARABIC LETTER DYEH 0685: څ ARABIC LETTER HAH WITH THREE DOTS ABOVE 0686: چ ARABIC LETTER TCHEH 0687: ڇ ARABIC LETTER TCHEHEH 0688: ڈ ARABIC LETTER DDAL 0689: ډ ARABIC LETTER DAL WITH RING 068A: ڊ ARABIC LETTER DAL WITH DOT BELOW 068B: ڋ ARABIC LETTER DAL WITH DOT BELOW AND SMALL TAH 068C: ڌ ARABIC LETTER DAHAL 068D: ڍ ARABIC LETTER DDAHAL 068E: ڎ ARABIC LETTER DUL 068F: ڏ ARABIC LETTER DAL WITH THREE DOTS ABOVE DOWNWARDS 0690: ڐ ARABIC LETTER DAL WITH FOUR DOTS ABOVE 0691: ڑ ARABIC LETTER RREH 0692: ڒ ARABIC LETTER REH WITH SMALL V 0693: ړ ARABIC LETTER REH WITH RING 0694: ڔ ARABIC LETTER REH WITH DOT BELOW 0695: ڕ ARABIC LETTER REH WITH SMALL V BELOW 0696: ږ ARABIC LETTER REH WITH DOT BELOW AND DOT ABOVE 0697: ڗ ARABIC LETTER REH WITH TWO DOTS ABOVE 0698: ژ ARABIC LETTER JEH 0699: ڙ ARABIC LETTER REH WITH FOUR DOTS ABOVE 069A: ښ ARABIC LETTER SEEN WITH DOT BELOW AND DOT ABOVE 069B: ڛ ARABIC LETTER SEEN WITH THREE DOTS BELOW 069C: ڜ ARABIC LETTER SEEN WITH THREE DOTS BELOW AND THREE DOTS ABOVE 069D: ڝ ARABIC LETTER SAD WITH TWO DOTS BELOW 069E: ڞ ARABIC LETTER SAD WITH THREE DOTS ABOVE 069F: ڟ ARABIC LETTER TAH WITH THREE DOTS ABOVE 06A0: ڠ ARABIC LETTER AIN WITH THREE DOTS ABOVE 06A1: ڡ ARABIC LETTER DOTLESS FEH 06A2: ڢ ARABIC LETTER FEH WITH DOT MOVED BELOW 06A3: ڣ ARABIC LETTER FEH WITH DOT BELOW 06A4: ڤ ARABIC LETTER VEH 06A5: ڥ ARABIC LETTER FEH WITH THREE DOTS BELOW 06A6: ڦ ARABIC LETTER PEHEH 06A7: ڧ ARABIC LETTER QAF WITH DOT ABOVE 06A8: ڨ ARABIC LETTER QAF WITH THREE DOTS ABOVE 06A9: ک ARABIC LETTER KEHEH 06AA: ڪ ARABIC LETTER SWASH KAF 06AB: ګ ARABIC LETTER KAF WITH RING 06AC: ڬ ARABIC LETTER KAF WITH DOT ABOVE 06AD: ڭ ARABIC LETTER NG 06AE: ڮ ARABIC LETTER KAF WITH THREE DOTS BELOW 06AF: گ ARABIC LETTER GAF 06B0: ڰ ARABIC LETTER GAF WITH RING 06B1: ڱ ARABIC LETTER NGOEH 06B2: ڲ ARABIC LETTER GAF WITH TWO DOTS BELOW 06B3: ڳ ARABIC LETTER GUEH 06B4: ڴ ARABIC LETTER GAF WITH THREE DOTS ABOVE 06B5: ڵ ARABIC LETTER LAM WITH SMALL V 06B6: ڶ ARABIC LETTER LAM WITH DOT ABOVE 06B7: ڷ ARABIC LETTER LAM WITH THREE DOTS ABOVE 06B8: ڸ ARABIC LETTER LAM WITH THREE DOTS BELOW 06B9: ڹ ARABIC LETTER NOON WITH DOT BELOW 06BA: ں ARABIC LETTER NOON GHUNNA 06BB: ڻ ARABIC LETTER RNOON 06BC: ڼ ARABIC LETTER NOON WITH RING 06BD: ڽ ARABIC LETTER NOON WITH THREE DOTS ABOVE 06BE: ھ ARABIC LETTER HEH DOACHASHMEE 06C1: ہ ARABIC LETTER HEH GOAL 06C3: ۃ ARABIC LETTER TEH MARBUTA GOAL 06CC: ی ARABIC LETTER FARSI YEH 06D2: ے ARABIC LETTER YEH BARREE The following combinations of base character and diacritic as a single character will also be allowed: 0622: آ ARABIC LETTER ALEF WITH MADDA ABOVE 0623: أ ARABIC LETTER ALEF WITH HAMZA ABOVE 0624: ؤ ARABIC LETTER WAW WITH HAMZA ABOVE 06C2: ۂ ARABIC LETTER HEH GOAL WITH HAMZA ABOVE 06D3: ۓ ARABIC LETTER YEH BARREE WITH HAMZA ABOVE The following Urdu digits are allowed: 06F0: ۰ EXTENDED ARABIC-INDIC DIGIT ZERO 06F1: ۱ EXTENDED ARABIC-INDIC DIGIT ONE 06F2: ۲ EXTENDED ARABIC-INDIC DIGIT TWO 06F3: ۳ EXTENDED ARABIC-INDIC DIGIT THREE 06F4: ۴ EXTENDED ARABIC-INDIC DIGIT FOUR 06F5: ۵ EXTENDED ARABIC-INDIC DIGIT FIVE 06F6: ۶ EXTENDED ARABIC-INDIC DIGIT SIX 06F7: ۷ EXTENDED ARABIC-INDIC DIGIT SEVEN 06F8: ۸ EXTENDED ARABIC-INDIC DIGIT EIGHT 06F9: ۹ EXTENDED ARABIC-INDIC DIGIT NINE NOT ALLOWED The following are disallowed because they don't appear in plain text: 0600: ARABIC NUMBER SIGN 0601: ARABIC SIGN SANAH 060E: ؎ ARABIC POETIC VERSE SIGN 060F: ؏ ARABIC SIGN MISRA 0602: ARABIC FOOTNOTE MARKER 0603: ARABIC SIGN SAFHA 066D: ٭ ARABIC FIVE POINTED STAR The following are disallowed because they are mathematical signs or, in the case of the tatweel, just stylistic: 066A: ٪ ARABIC PERCENT SIGN 066B: ٫ ARABIC DECIMAL SEPARATOR 066C: ٬ ARABIC THOUSANDS SEPARATOR 0640: ـ ARABIC TATWEEL The following combining characters are allowed, since their absence or omission can change meaning: 0653: ٓ ARABIC MADDAH ABOVE 0654: ٔ ARABIC HAMZA ABOVE The following are disallowed because they are punctuation: 061B: ؛ ARABIC SEMICOLON 061F: ؟ ARABIC QUESTION MARK 06D4: ۔ ARABIC FULL STOP 060C: ، ARABIC COMMA 060D: ؍ ARABIC DATE SEPARATOR The remaining characters are not used for Urdu: 061E: ؞ ARABIC TRIPLE DOT PUNCTUATION MARK 060B: ؋ AFGHANI SIGN 0615: ؕ ARABIC SMALL HIGH TAH 0657: ٗ ARABIC INVERTED DAMMA 0659: ٙ ARABIC ZWARAKAY 065A: ٚ ARABIC VOWEL SIGN SMALL V ABOVE 065B: ٛ ARABIC VOWEL SIGN INVERTED SMALL V ABOVE 065C: ٜ ARABIC VOWEL SIGN DOT BELOW 065D: ٝ ARABIC REVERSED DAMMA 065E: ٞ ARABIC FATHA WITH TWO DOTS 066E: ٮ ARABIC LETTER DOTLESS BEH 066F: ٯ ARABIC LETTER DOTLESS QAF 0671: ٱ ARABIC LETTER ALEF WASLA 0672: ٲ ARABIC LETTER ALEF WITH WAVY HAMZA ABOVE 0673: ٳ ARABIC LETTER ALEF WITH WAVY HAMZA BELOW 0674: ٴ ARABIC LETTER HIGH HAMZA 0675: ٵ ARABIC LETTER HIGH HAMZA ALEF 0676: ٶ ARABIC LETTER HIGH HAMZA WAW 0677: ٷ ARABIC LETTER U WITH HAMZA ABOVE 0678: ٸ ARABIC LETTER HIGH HAMZA YEH 06BF: ڿ ARABIC LETTER TCHEH WITH DOT ABOVE 06C4: ۄ ARABIC LETTER WAW WITH RING 06C5: ۅ ARABIC LETTER KIRGHIZ OE 06C6: ۆ ARABIC LETTER OE 06C7: ۇ ARABIC LETTER U 06C8: ۈ ARABIC LETTER YU 06C9: ۉ ARABIC LETTER KIRGHIZ YU 06CA: ۊ ARABIC LETTER WAW WITH TWO DOTS ABOVE 06CB: ۋ ARABIC LETTER VE 06CD: ۍ ARABIC LETTER YEH WITH TAIL 06CE: ێ ARABIC LETTER YEH WITH SMALL V 06CF: ۏ ARABIC LETTER WAW WITH DOT ABOVE 06D0: ې ARABIC LETTER E 06D1: ۑ ARABIC LETTER YEH WITH THREE DOTS BELOW 06D5: ە ARABIC LETTER AE 06D6: ۖ ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA 06D7: ۗ ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA 06D8: ۘ ARABIC SMALL HIGH MEEM INITIAL FORM 06D9: ۙ ARABIC SMALL HIGH LAM ALEF 06DA: ۚ ARABIC SMALL HIGH JEEM 06DB: ۛ ARABIC SMALL HIGH THREE DOTS 06DC: ۜ ARABIC SMALL HIGH SEEN 06DD: ARABIC END OF AYAH 06DE: ۞ ARABIC START OF RUB EL HIZB 06DF: ۟ ARABIC SMALL HIGH ROUNDED ZERO 06E0: ۠ ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO 06E1: ۡ ARABIC SMALL HIGH DOTLESS HEAD OF KHAH 06E2: ۢ ARABIC SMALL HIGH MEEM ISOLATED FORM 06E3: ۣ ARABIC SMALL LOW SEEN 06E4: ۤ ARABIC SMALL HIGH MADDA 06E5: ۥ ARABIC SMALL WAW 06E6: ۦ ARABIC SMALL YEH 06E7: ۧ ARABIC SMALL HIGH YEH 06E8: ۨ ARABIC SMALL HIGH NOON 06E9: ۩ ARABIC PLACE OF SAJDAH 06EA: ۪ ARABIC EMPTY CENTRE LOW STOP 06EB: ۫ ARABIC EMPTY CENTRE HIGH STOP 06EC: ۬ ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE 06ED: ۭ ARABIC SMALL LOW MEEM 06EE: ۮ ARABIC LETTER DAL WITH INVERTED V 06EF: ۯ ARABIC LETTER REH WITH INVERTED V 06FA: ۺ ARABIC LETTER SHEEN WITH DOT BELOW 06FB: ۻ ARABIC LETTER DAD WITH DOT BELOW 06FC: ۼ ARABIC LETTER GHAIN WITH DOT BELOW 06FD: ۽ ARABIC SIGN SINDHI AMPERSAND 06FE: ۾ ARABIC SIGN SINDHI POSTPOSITION MEN ============ Richard Ishida Internationalization Lead W3C (World Wide Web Consortium) http://www.w3.org/People/Ishida/ http://www.w3.org/International/ http://people.w3.org/rishida/blog/ http://www.flickr.com/photos/ishida/
Attachments
- application/vnd.ms-excel attachment: Sample_Urdu_IDNInfo.xls
Received on Monday, 30 July 2007 20:09:30 UTC