- From: Debbie Garside <md@ictenterprise.co.uk>
- Date: Thu, 02 Aug 2007 09:02:39 -0400
- To: "'Richard Ishida'" <ishida@w3.org>, <www-international@w3.org>, <public-iri@w3.org>
- Cc: "'Sarmad Hussain'" <sarmad.hussain@nu.edu.pk>
Richard wrote: > I’m inclined to think that creating a plug-in might > create more trouble than benefit, by replacing the problems > of errors and ambiguities with the problems of uninteroperable IDNs. I agree. There is a current problem with plug-ins and IE6/IE7 for the Hangul IDN user community. http://public.icann.org/forums/public-forum#comment-321 Debbie Garside > -----Original Message----- > From: www-international-request@w3.org > [mailto:www-international-request@w3.org] On Behalf Of Richard Ishida > Sent: 30 July 2007 21:11 > To: www-international@w3.org; public-iri@w3.org > Cc: 'Sarmad Hussain' > Subject: Urdu IDNs: Characters in domain names > > Sarmad Hussain, at the Center for Research in Urdu Language > Processing FAST National University, Pakistan, is looking at > enabling Urdu IDNs based on ICANN recommendations, but this > may lead to similar approaches in a number of other countries. > > There are some aspects to Sarmad’s proposal, arising from the > nature of the Arabic script used for Urdu, that raise some > interesting questions about the way IDN works for this kind > of language. These have to do with the choice of characters > allowed in a domain name. > > For example, there is a suggestion that users should be able > to use certain characters when writing a URI in Urdu which > are then either removed (eg. vowel diacritics) or converted > to other characters (eg. Arabic characters) during the > conversion to punycode by a user agent plug-in. > > This is not something that is normally relevant for > English-only URIs, because of the relative simplicity of our > alphabet. There is much more potential ambiguity in Urdu for > use of characters. Note, however, that the proposals Sarmad > is making are language-specific, not script-specific, ie. > Arabic or Persian (also written with the Arabic script) would > need some slightly different rules. > > I find myself wondering whether you could use a plug-in to > strip out or convert the characters while converting to > punycode. People typing IDNs in Urdu would need to be aware > of the need for a plug-in, and would still need to know how > to type in IDNs if they found themselves using a browser that > didn’t have the plug-in (eg. the businessman who is visiting > a corporation in the US that prevents ad hoc downloads of > software). On the one hand, I wonder whether we can expect a > user who sees a URI on a hard copy brochure containing vowel > diacritics to know what to do if their browser or mail client > doesn’t support the plug-in. On the other hand, a person > writing a clickable URI in HTML or an email would not be able > to guarantee that users would have access to the plug-in. In > that case, they would be unwise to use things like short > vowel diacritics, since the user cannot easily change the > link if they don’t have a plug-in. Imagine a vowelled IDN > coming through in a plain text email, for example: the reader > may need to edit the email text to get to the resource rather > than just click on it. Not likely to be popular. > > Another alternative is to do such removal and conversion of > characters as part of the standard punycode conversion > process. This, I suspect, would necessitate every browser to > have access to standardised tables of characters that should > be ignored or converted for any language. But there is an > additional problem in that the language would need to be > determined correctly before such rules were applied - that > is, the language of the original URI. That too seems a bit difficult. > > So I can see the need, but I’m not sure what the solution > would be. I’m inclined to think that creating a plug-in might > create more trouble than benefit, by replacing the problems > of errors and ambiguities with the problems of uninteroperable IDNs. > > There is an Excel file attached that lists which characters > in the Arabic block would be appropriate for Urdu IDNs. I > will also list the characters below in a slightly different order. > > ALLOWED > > The following characters will allowed in the IRI but removed > before conversion to punycode. > > These characters are optional in Arabic script, though they > can sometimes be useful for disambiguating pronunciation and > meaning - particularly useful for Urdu, which has more vowel > sounds than Arabic. > > 064B: ً ARABIC FATHATAN > 064C: ٌ ARABIC DAMMATAN > 064D: ٍ ARABIC KASRATAN > 064E: َ ARABIC FATHA > 064F: ُ ARABIC DAMMA > 0650: ِ ARABIC KASRA > 0651: ّ ARABIC SHADDA > 0652: ْ ARABIC SUKUN > 0655: ٕ ARABIC HAMZA BELOW > 0656: ٖ ARABIC SUBSCRIPT ALEF > 0658: ٘ ARABIC MARK NOON GHUNNA > 0670: ٰ ARABIC LETTER SUPERSCRIPT ALEF > 0612: ؒ ARABIC SIGN RAHMATULLAH ALAYHE > 0614: ؔ ARABIC SIGN TAKHALLUS > > Space and zero-width non-joiner characters will also be > allowed, but removed during the conversion to punycode. > > Some other characters used in Arabic but not Urdu will be > allowed but will be converted to a character used in Urdu > during conversion to punycode. They are included in the set > of allowed characters, however, to avoid confusion when they > are used incorrectly. > > 0629: ة ARABIC LETTER TEH MARBUTA > 0643: ك ARABIC LETTER KAF > 0649: ى ARABIC LETTER ALEF MAKSURA > 064A: ي ARABIC LETTER YEH > 0660: ٠ ARABIC-INDIC DIGIT ZERO > 0661: ١ ARABIC-INDIC DIGIT ONE > 0662: ٢ ARABIC-INDIC DIGIT TWO > 0663: ٣ ARABIC-INDIC DIGIT THREE > 0664: ٤ ARABIC-INDIC DIGIT FOUR > 0665: ٥ ARABIC-INDIC DIGIT FIVE > 0666: ٦ ARABIC-INDIC DIGIT SIX > 0667: ٧ ARABIC-INDIC DIGIT SEVEN > 0668: ٨ ARABIC-INDIC DIGIT EIGHT > 0669: ٩ ARABIC-INDIC DIGIT NINE > 06C0: ۀ ARABIC LETTER HEH WITH YEH ABOVE > 0625: إ ARABIC LETTER ALEF WITH HAMZA BELOW > > European digits are also mapped to the Urdu digits. > > The following will be permitted in the IRI, but decomposed > before conversion to punycode: > > FDF2: ﷲ ARABIC LIGATURE ALLAH ISOLATED FORM > FDF3: ﷳ ARABIC LIGATURE AKBAR ISOLATED FORM > FDF4: ﷴ ARABIC LIGATURE MOHAMMAD ISOLATED FORM > FDF5: ﷵ ARABIC LIGATURE SALAM ISOLATED FORM > FDF6: ﷶ ARABIC LIGATURE RASOUL ISOLATED FORM > FDF7: ﷷ ARABIC LIGATURE ALAYHE ISOLATED FORM > FDF8: ﷸ ARABIC LIGATURE WASALLAM ISOLATED FORM > FDF9: ﷹ ARABIC LIGATURE SALLA ISOLATED FORM > FDFA: ﷺ ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM > FDFB: ﷻ ARABIC LIGATURE JALLAJALALOUHOU > > Some combinations of diacritic and base character will be allowed: > > 0622: آ ARABIC LETTER ALEF WITH MADDA ABOVE > 0623: أ ARABIC LETTER ALEF WITH HAMZA ABOVE > 0624: ؤ ARABIC LETTER WAW WITH HAMZA ABOVE > 06C2: ۂ ARABIC LETTER HEH GOAL WITH HAMZA ABOVE > 06D3: ۓ ARABIC LETTER YEH BARREE WITH HAMZA ABOVE > > The following mandatory honorific marks will be allowed: > > 0610: ؐ ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAM > 0611: ؑ ARABIC SIGN ALAYHE ASSALLAM > 0613: ؓ ARABIC SIGN RADI ALLAHOU ANHU > > The following are basic alphabetic characters for Urdu, and > will therefore be allowed: > > 0621: ء ARABIC LETTER HAMZA > 0627: ا ARABIC LETTER ALEF > 0628: ب ARABIC LETTER BEH > 062A: ت ARABIC LETTER TEH > 062B: ث ARABIC LETTER THEH > 062C: ج ARABIC LETTER JEEM > 062D: ح ARABIC LETTER HAH > 062E: خ ARABIC LETTER KHAH > 062F: د ARABIC LETTER DAL > 0630: ذ ARABIC LETTER THAL > 0631: ر ARABIC LETTER REH > 0632: ز ARABIC LETTER ZAIN > 0633: س ARABIC LETTER SEEN > 0634: ش ARABIC LETTER SHEEN > 0635: ص ARABIC LETTER SAD > 0636: ض ARABIC LETTER DAD > 0637: ط ARABIC LETTER TAH > 0638: ظ ARABIC LETTER ZAH > 0639: ع ARABIC LETTER AIN > 063A: غ ARABIC LETTER GHAIN > 0641: ف ARABIC LETTER FEH > 0642: ق ARABIC LETTER QAF > 0644: ل ARABIC LETTER LAM > 0645: م ARABIC LETTER MEEM > 0646: ن ARABIC LETTER NOON > 0647: ه ARABIC LETTER HEH > 0648: و ARABIC LETTER WAW > 0679: ٹ ARABIC LETTER TTEH > 067A: ٺ ARABIC LETTER TTEHEH > 067B: ٻ ARABIC LETTER BEEH > 067C: ټ ARABIC LETTER TEH WITH RING > 067D: ٽ ARABIC LETTER TEH WITH THREE DOTS ABOVE DOWNWARDS > 067E: پ ARABIC LETTER PEH > 067F: ٿ ARABIC LETTER TEHEH > 0680: ڀ ARABIC LETTER BEHEH > 0681: ځ ARABIC LETTER HAH WITH HAMZA ABOVE > 0682: ڂ ARABIC LETTER HAH WITH TWO DOTS VERTICAL ABOVE > 0683: ڃ ARABIC LETTER NYEH > 0684: ڄ ARABIC LETTER DYEH > 0685: څ ARABIC LETTER HAH WITH THREE DOTS ABOVE > 0686: چ ARABIC LETTER TCHEH > 0687: ڇ ARABIC LETTER TCHEHEH > 0688: ڈ ARABIC LETTER DDAL > 0689: ډ ARABIC LETTER DAL WITH RING > 068A: ڊ ARABIC LETTER DAL WITH DOT BELOW > 068B: ڋ ARABIC LETTER DAL WITH DOT BELOW AND SMALL TAH > 068C: ڌ ARABIC LETTER DAHAL > 068D: ڍ ARABIC LETTER DDAHAL > 068E: ڎ ARABIC LETTER DUL > 068F: ڏ ARABIC LETTER DAL WITH THREE DOTS ABOVE DOWNWARDS > 0690: ڐ ARABIC LETTER DAL WITH FOUR DOTS ABOVE > 0691: ڑ ARABIC LETTER RREH > 0692: ڒ ARABIC LETTER REH WITH SMALL V > 0693: ړ ARABIC LETTER REH WITH RING > 0694: ڔ ARABIC LETTER REH WITH DOT BELOW > 0695: ڕ ARABIC LETTER REH WITH SMALL V BELOW > 0696: ږ ARABIC LETTER REH WITH DOT BELOW AND DOT ABOVE > 0697: ڗ ARABIC LETTER REH WITH TWO DOTS ABOVE > 0698: ژ ARABIC LETTER JEH > 0699: ڙ ARABIC LETTER REH WITH FOUR DOTS ABOVE > 069A: ښ ARABIC LETTER SEEN WITH DOT BELOW AND DOT ABOVE > 069B: ڛ ARABIC LETTER SEEN WITH THREE DOTS BELOW > 069C: ڜ ARABIC LETTER SEEN WITH THREE DOTS BELOW AND > THREE DOTS ABOVE > 069D: ڝ ARABIC LETTER SAD WITH TWO DOTS BELOW > 069E: ڞ ARABIC LETTER SAD WITH THREE DOTS ABOVE > 069F: ڟ ARABIC LETTER TAH WITH THREE DOTS ABOVE > 06A0: ڠ ARABIC LETTER AIN WITH THREE DOTS ABOVE > 06A1: ڡ ARABIC LETTER DOTLESS FEH > 06A2: ڢ ARABIC LETTER FEH WITH DOT MOVED BELOW > 06A3: ڣ ARABIC LETTER FEH WITH DOT BELOW > 06A4: ڤ ARABIC LETTER VEH > 06A5: ڥ ARABIC LETTER FEH WITH THREE DOTS BELOW > 06A6: ڦ ARABIC LETTER PEHEH > 06A7: ڧ ARABIC LETTER QAF WITH DOT ABOVE > 06A8: ڨ ARABIC LETTER QAF WITH THREE DOTS ABOVE > 06A9: ک ARABIC LETTER KEHEH > 06AA: ڪ ARABIC LETTER SWASH KAF > 06AB: ګ ARABIC LETTER KAF WITH RING > 06AC: ڬ ARABIC LETTER KAF WITH DOT ABOVE > 06AD: ڭ ARABIC LETTER NG > 06AE: ڮ ARABIC LETTER KAF WITH THREE DOTS BELOW > 06AF: گ ARABIC LETTER GAF > 06B0: ڰ ARABIC LETTER GAF WITH RING > 06B1: ڱ ARABIC LETTER NGOEH > 06B2: ڲ ARABIC LETTER GAF WITH TWO DOTS BELOW > 06B3: ڳ ARABIC LETTER GUEH > 06B4: ڴ ARABIC LETTER GAF WITH THREE DOTS ABOVE > 06B5: ڵ ARABIC LETTER LAM WITH SMALL V > 06B6: ڶ ARABIC LETTER LAM WITH DOT ABOVE > 06B7: ڷ ARABIC LETTER LAM WITH THREE DOTS ABOVE > 06B8: ڸ ARABIC LETTER LAM WITH THREE DOTS BELOW > 06B9: ڹ ARABIC LETTER NOON WITH DOT BELOW > 06BA: ں ARABIC LETTER NOON GHUNNA > 06BB: ڻ ARABIC LETTER RNOON > 06BC: ڼ ARABIC LETTER NOON WITH RING > 06BD: ڽ ARABIC LETTER NOON WITH THREE DOTS ABOVE > 06BE: ھ ARABIC LETTER HEH DOACHASHMEE > 06C1: ہ ARABIC LETTER HEH GOAL > 06C3: ۃ ARABIC LETTER TEH MARBUTA GOAL > 06CC: ی ARABIC LETTER FARSI YEH > 06D2: ے ARABIC LETTER YEH BARREE > > The following combinations of base character and diacritic as > a single character will also be allowed: > > 0622: آ ARABIC LETTER ALEF WITH MADDA ABOVE > 0623: أ ARABIC LETTER ALEF WITH HAMZA ABOVE > 0624: ؤ ARABIC LETTER WAW WITH HAMZA ABOVE > 06C2: ۂ ARABIC LETTER HEH GOAL WITH HAMZA ABOVE > 06D3: ۓ ARABIC LETTER YEH BARREE WITH HAMZA ABOVE > > The following Urdu digits are allowed: > > 06F0: ۰ EXTENDED ARABIC-INDIC DIGIT ZERO > 06F1: ۱ EXTENDED ARABIC-INDIC DIGIT ONE > 06F2: ۲ EXTENDED ARABIC-INDIC DIGIT TWO > 06F3: ۳ EXTENDED ARABIC-INDIC DIGIT THREE > 06F4: ۴ EXTENDED ARABIC-INDIC DIGIT FOUR > 06F5: ۵ EXTENDED ARABIC-INDIC DIGIT FIVE > 06F6: ۶ EXTENDED ARABIC-INDIC DIGIT SIX > 06F7: ۷ EXTENDED ARABIC-INDIC DIGIT SEVEN > 06F8: ۸ EXTENDED ARABIC-INDIC DIGIT EIGHT > 06F9: ۹ EXTENDED ARABIC-INDIC DIGIT NINE > > > > > NOT ALLOWED > > The following are disallowed because they don't appear in > plain text: > > 0600: ARABIC NUMBER SIGN > 0601: ARABIC SIGN SANAH > 060E: ؎ ARABIC POETIC VERSE SIGN > 060F: ؏ ARABIC SIGN MISRA > 0602: ARABIC FOOTNOTE MARKER > 0603: ARABIC SIGN SAFHA > 066D: ٭ ARABIC FIVE POINTED STAR > > > The following are disallowed because they are mathematical > signs or, in the case of the tatweel, just stylistic: > > 066A: ٪ ARABIC PERCENT SIGN > 066B: ٫ ARABIC DECIMAL SEPARATOR > 066C: ٬ ARABIC THOUSANDS SEPARATOR > 0640: ـ ARABIC TATWEEL > > The following combining characters are allowed, since their > absence or omission can change meaning: > > 0653: ٓ ARABIC MADDAH ABOVE > 0654: ٔ ARABIC HAMZA ABOVE > > The following are disallowed because they are punctuation: > > 061B: ؛ ARABIC SEMICOLON > 061F: ؟ ARABIC QUESTION MARK > 06D4: ۔ ARABIC FULL STOP > 060C: ، ARABIC COMMA > 060D: ؍ ARABIC DATE SEPARATOR > > > The remaining characters are not used for Urdu: > > 061E: ؞ ARABIC TRIPLE DOT PUNCTUATION MARK > 060B: ؋ AFGHANI SIGN > 0615: ؕ ARABIC SMALL HIGH TAH > 0657: ٗ ARABIC INVERTED DAMMA > 0659: ٙ ARABIC ZWARAKAY > 065A: ٚ ARABIC VOWEL SIGN SMALL V ABOVE > 065B: ٛ ARABIC VOWEL SIGN INVERTED SMALL V ABOVE > 065C: ٜ ARABIC VOWEL SIGN DOT BELOW > 065D: ٝ ARABIC REVERSED DAMMA > 065E: ٞ ARABIC FATHA WITH TWO DOTS > 066E: ٮ ARABIC LETTER DOTLESS BEH > 066F: ٯ ARABIC LETTER DOTLESS QAF > 0671: ٱ ARABIC LETTER ALEF WASLA > 0672: ٲ ARABIC LETTER ALEF WITH WAVY HAMZA ABOVE > 0673: ٳ ARABIC LETTER ALEF WITH WAVY HAMZA BELOW > 0674: ٴ ARABIC LETTER HIGH HAMZA > 0675: ٵ ARABIC LETTER HIGH HAMZA ALEF > 0676: ٶ ARABIC LETTER HIGH HAMZA WAW > 0677: ٷ ARABIC LETTER U WITH HAMZA ABOVE > 0678: ٸ ARABIC LETTER HIGH HAMZA YEH > 06BF: ڿ ARABIC LETTER TCHEH WITH DOT ABOVE > 06C4: ۄ ARABIC LETTER WAW WITH RING > 06C5: ۅ ARABIC LETTER KIRGHIZ OE > 06C6: ۆ ARABIC LETTER OE > 06C7: ۇ ARABIC LETTER U > 06C8: ۈ ARABIC LETTER YU > 06C9: ۉ ARABIC LETTER KIRGHIZ YU > 06CA: ۊ ARABIC LETTER WAW WITH TWO DOTS ABOVE > 06CB: ۋ ARABIC LETTER VE > 06CD: ۍ ARABIC LETTER YEH WITH TAIL > 06CE: ێ ARABIC LETTER YEH WITH SMALL V > 06CF: ۏ ARABIC LETTER WAW WITH DOT ABOVE > 06D0: ې ARABIC LETTER E > 06D1: ۑ ARABIC LETTER YEH WITH THREE DOTS BELOW > 06D5: ە ARABIC LETTER AE > 06D6: ۖ ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA > 06D7: ۗ ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA > 06D8: ۘ ARABIC SMALL HIGH MEEM INITIAL FORM > 06D9: ۙ ARABIC SMALL HIGH LAM ALEF > 06DA: ۚ ARABIC SMALL HIGH JEEM > 06DB: ۛ ARABIC SMALL HIGH THREE DOTS > 06DC: ۜ ARABIC SMALL HIGH SEEN > 06DD: ARABIC END OF AYAH > 06DE: ۞ ARABIC START OF RUB EL HIZB > 06DF: ۟ ARABIC SMALL HIGH ROUNDED ZERO > 06E0: ۠ ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO > 06E1: ۡ ARABIC SMALL HIGH DOTLESS HEAD OF KHAH > 06E2: ۢ ARABIC SMALL HIGH MEEM ISOLATED FORM > 06E3: ۣ ARABIC SMALL LOW SEEN > 06E4: ۤ ARABIC SMALL HIGH MADDA > 06E5: ۥ ARABIC SMALL WAW > 06E6: ۦ ARABIC SMALL YEH > 06E7: ۧ ARABIC SMALL HIGH YEH > 06E8: ۨ ARABIC SMALL HIGH NOON > 06E9: ۩ ARABIC PLACE OF SAJDAH > 06EA: ۪ ARABIC EMPTY CENTRE LOW STOP > 06EB: ۫ ARABIC EMPTY CENTRE HIGH STOP > 06EC: ۬ ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE > 06ED: ۭ ARABIC SMALL LOW MEEM > 06EE: ۮ ARABIC LETTER DAL WITH INVERTED V > 06EF: ۯ ARABIC LETTER REH WITH INVERTED V > 06FA: ۺ ARABIC LETTER SHEEN WITH DOT BELOW > 06FB: ۻ ARABIC LETTER DAD WITH DOT BELOW > 06FC: ۼ ARABIC LETTER GHAIN WITH DOT BELOW > 06FD: ۽ ARABIC SIGN SINDHI AMPERSAND > 06FE: ۾ ARABIC SIGN SINDHI POSTPOSITION MEN > > > > > ============ > Richard Ishida > Internationalization Lead > W3C (World Wide Web Consortium) > > http://www.w3.org/People/Ishida/ > http://www.w3.org/International/ > http://people.w3.org/rishida/blog/ > http://www.flickr.com/photos/ishida/ > >
Received on Thursday, 2 August 2007 13:02:47 UTC