Re: URIs and i18n

> I think you should register two domain names, one in native
> script and one in ascii only, if you want your URIs to be used
> internationally as well as by your home market/user base.

In the case of domain name registrations, please keep in mind that the 
native script registration is stored inside domain name registry systems as 
ASCII, and is shown in a user-unfriendly format with a prefix of <xn-->.

Registering two domain names, one in the native script and another in ascii 
only is a useful trick; this will become an even more useful method when 
domain names transform from <native script label>.<ascii domain> to <native 
script label>.<native script domain> (otherwise called IDN TLDs).

Use of localized labels in email poses greater challenges, since automatic 
downgrades to ascii are required when the original string is sent in a local 
script.  That's a discussion for another day.

One of the biggest concerns is with script mixing, where ASCII and several 
local scripts get intermingled in an IRI.  In my opinion, this is quite a 
bad thing, leads to a great deal of user confusion and potential for 
phishing - it's one of the biggest things that should be explicitly 
restricted (a few languages exist where script mixing is required, but these 
are finite and definable as exceptions).


----- Original Message ----- 
From: "Richard Ishida" <>
To: <>
Cc: "'Jean-Guilhem Rouel'" <>
Sent: Tuesday, July 29, 2008 7:45 AM
Subject: RE: URIs and i18n

> I think it is important to note that Jean-Gui plans (if I correctly
> understood his plans from an off-line discussion I had with him) to allow
> people to enter their name into his application in native script, but will
> provide an additional field for them to input an ascii-only version of the
> name that will be used in the URIs.
> Jean-Gui, perhaps it would help to give some examples of how the URIs
> (specifically) will be used.  My understanding is  that people speaking
> various different languages will be using your application and will need 
> to
> read or type a URI on a regular basis.  This would cause difficulties for
> the use of IRIs - a person who is not able to type/read Japanese will have
> difficulty dealing with a URI such as
> One question in my mind is why people should be reading/writing the URI,
> rather than using the application's power to automatically
> obtain/present/package the needed information.
> However, even if it were possible for the application to hide the URIs and
> present people's names to users, many people would still struggle with
> information that only presented a person's name in their native script, 
> and
> would want to see the name in a script that they recognize.  Transcribing 
> on
> a language by language basis into every possible/sensible script would be 
> a
> very tall order, and one that I think is not needed - especially for
> technology related applications. I think ascii characters are a reasonable
> choice for a single, universal script.
> The question then becomes how to achieve an ascii version of a person's
> name.  Again, I think it is a tough challenge to achieve this
> computationally, when each language differs in the way you transcribe to
> ascii, and there are usually multiple possible transcription methods for a
> given language.  A person may also have an idiosyncratic way of spelling
> their name, which should be allowed for.  I also think that just stripping
> accents can lead to clashes of previously differentiated names, to
> unfortunate spellings, and so on - not to mention the fact that it doesn't
> help at all for Arabic, Chinese, Armenian, etc, etc.
> So I think that asking people, when they supply their name, to provide it
> both in native format and in their preferred ascii-only spelling is 
> probably
> the best way. Then both forms of the name should be available whenever a
> name is used.  This means that if people are expected to read/write URIs,
> the IRI and the ascii-only URIs should be equivalent.
> Note that this can also be extended to postal addresses, company names, 
> etc.
> People should be able to choose to look up, say, a Russian company address
> in either local (Cyrillic) or international (Latin) formats, so you need 
> to
> collect and store both forms.
> This also ties in with what I've been saying for years now about the 
> general
> use of IRIs.  I think you should register two domain names, one in native
> script and one in ascii only, if you want your URIs to be used
> internationally as well as by your home market/user base.  I certainly
> encourage the use of IRIs for local use, but there needs to be an
> alternative for others if your URI is exposed to other cultural/linguistic
> groups.
> RI
> ============
> Richard Ishida
> Internationalization Lead
> W3C (World Wide Web Consortium)
>> -----Original Message-----
>> From: [mailto:www-international-
>>] On Behalf Of Jean-Guilhem Rouel
>> Sent: 28 July 2008 19:03
>> To:
>> Subject: URIs and i18n
>> Hi,
>> I am currently writing a webapp which has URIs of the form
>> where fran-ois is a first
>> name, berl-and a family name and '-' replaces non-ASCII characters. So
>> fran-ois.berl-and could represent someone named François Berléand.
>> Now, I would like to have more "beautiful" URIs, like
>> I am wondering if there's a standard or something defining how to
>> "translate" non-ASCII characters to ASCII ones, be they French special
>> chars, Japanese ones or anything else.
>> If not, is it wise to try to do such a transcription? I don't know if
>> that makes a difference but the tool will be targeted at
>> English-speaking people (but the names can be from any culture).
>> Another possibility would be to use IRIs, but most people would end-up
>> having difficulties typing them and that would make them harder to
> remember.
>> Finally, I could let users choose their ASCII-only URI. I think that's
>> what I'm going to do as it's easier for me and the least likely to
>> offend people, but I would have liked to get your feedback on this topic.
>> Thanks,
>> Jean-Gui

Received on Tuesday, 29 July 2008 15:56:25 UTC