W3C home > Mailing lists > Public > www-international@w3.org > July to September 2008

RE: URIs and i18n

From: Phillips, Addison <addison@amazon.com>
Date: Mon, 28 Jul 2008 11:46:01 -0700
To: "www-international@w3.org" <www-international@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA014A0F5D9F@EX-SEA5-D.ant.amazon.com>
This is a pretty common problem. URIs are supposed to be human-friendly, so you'd expect to render names and such accurately. No one wants their ID to be some randomly assigned 19-digit hex number. And many technologies now include RESTful interfaces that rely on transmitting recognizable strings in the URI.

Another issue is search: search engines pluck words from URIs, but only if they recognize them. In many cases, search engine optimizations do "strip accents" to improve search accuracy.

On the other hand...

Removing accents is asking people to misspell their names. In some languages the "misspelling" isn't too awful and users kind of understand it. But as you move around the world, you find increasingly that it is annoying or just not possible. You can strip accents in French perhaps, but Cyrillic? Arabic? Greek? These languages just aren't addressed and users have to pick an ASCII alias. Even in "Latin-script" languages, you'll find that some languages, names, or ethnic groups are seriously disadvantaged by the "remove accents" approach.

So I tend to think that the best choice is to use IRIs as a *starting* point. Modern browsers support them in the address bar and people who share a language or writing system can almost always deal with the "typing problem": if you're French, typing François is no big deal (you might even have a little trouble remembering NOT to type the c-cedilla).

If you're German, it might be a bigger issue, since you don't have that key on your keyboard. So you can provide an algorithm for "mutating" the ID. But you might want allow users to select their own ASCII-only ID (perhaps offering some starting suggestions). So a Japanese person might want to be known as "佐藤龍一" by to non-Japanese colleagues as "Ryuichi Sato". Allowing users to choose here makes people happier.

And, of course, you have to think of abuse. You are running a namespace, so you'll want to do things to prevent people from doing bad things such as registering a name that looks identical to another name and thus impersonating someone. For example, using wide-ASCII compatibility characters in Unicode to register a name that is "taken" in plain ASCII. Or like this Cyrillic name: ОВАМА. There is a Unicode Technical Report on this (UTR#39 http://www.unicode.org/reports/tr39/) and the IDNA RFCs deal with this issue also. It's something to ponder before opening up your namespace.

Hope that helps,

Addison

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.


> -----Original Message-----
> From: www-international-request@w3.org [mailto:www-international-
> request@w3.org] On Behalf Of Richard Ishida
> Sent: Monday, July 28, 2008 11:18 AM
> To: www-international@w3.org
> Subject: RE: URIs and i18n
>
>
> > Another possibility would be to use IRIs, but most people would
> end-up
> > having difficulties typing them and that would make them harder
> to
> remember.
>
> Well, people who don't write or speak the language of the person
> identified,
> that is.  (And maybe we should exclude a bunch of people like those
> on this
> list.)
>
> This is an interesting question.  I guess Jean-Gui is looking for
> URIs that
> can be read/written/recognized/remembered by an international
> audience,
> which is why he feels he needs to use ASCII only, rather than IRIs.
>
> Any thoughts on the matter?
>
> RI
>
> ============
> Richard Ishida
> Internationalization Lead
> W3C (World Wide Web Consortium)
>
> http://www.w3.org/International/

> http://rishida.net/

>
>
>
> > -----Original Message-----
> > From: www-international-request@w3.org [mailto:www-international-
> > request@w3.org] On Behalf Of Jean-Guilhem Rouel
> > Sent: 28 July 2008 19:03
> > To: www-international@w3.org
> > Subject: URIs and i18n
> >
> >
> > Hi,
> >
> > I am currently writing a webapp which has URIs of the form
> > http://example.org/users/fran-ois.berl-and where fran-ois is a
> first
> > name, berl-and a family name and '-' replaces non-ASCII
> characters. So
> > fran-ois.berl-and could represent someone named François Berléand.
> >
> > Now, I would like to have more "beautiful" URIs, like
> > http://example.org/users/francois.berleand.

> >
> > I am wondering if there's a standard or something defining how to
> > "translate" non-ASCII characters to ASCII ones, be they French
> special
> > chars, Japanese ones or anything else.
> >
> > If not, is it wise to try to do such a transcription? I don't
> know if
> > that makes a difference but the tool will be targeted at
> > English-speaking people (but the names can be from any culture).
> >
> > Another possibility would be to use IRIs, but most people would
> end-up
> > having difficulties typing them and that would make them harder
> to
> remember.
> >
> > Finally, I could let users choose their ASCII-only URI. I think
> that's
> > what I'm going to do as it's easier for me and the least likely
> to
> > offend people, but I would have liked to get your feedback on
> this topic.
> >
> > Thanks,
> > Jean-Gui
>
>

Received on Monday, 28 July 2008 18:46:45 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:17 GMT