Re: Language ranges with more than two sub-tag from Norbert Lindenberg on 2013-03-05 (www-international@w3.org from January to March 2013)

From: Norbert Lindenberg <w3@norbertlindenberg.com>
Date: Mon, 4 Mar 2013 22:42:00 -0800
To: Marcos Caceres <w3c@marcosc.com>
Cc: Norbert Lindenberg <w3@norbertlindenberg.com>, www-international <www-international@w3.org>, Addison Phillips <addison@lab126.com>
Message-Id: <DFA82AFE-96A5-4DA1-ACB7-D17B50E5E5A9@norbertlindenberg.com>
Hi Marcos,

You shouldn't just be looking at current use of language tags with more than language and country, but also at use cases where developers use private hacks for which BCP 47 offers clean, interoperable alternatives. (It could be that these developers don't know about BCP 47 yet or that they're using old infrastructure that doesn't support it yet).

Examples:

- The web site of the Hong Kong government, http://www.gov.hk, provides content in both traditional and simplified Chinese - note the 繁體 and 简体 in the upper right corner. People used to use zh-HK (or zh-TW) to mean traditional Chinese, but that obviously doesn't work here anymore. The web site uses its own non-standard tags, tc and sc. Instead, they should be using the BCP 47 tags zh-Hant and zh-Hans (or zh-Hant-HK and zh-Hans-HK.

- The web site of Sinovision, a Chinese-language broadcaster in the US, http://www.sinovision.net. If you click on its 繁体 link, you get to a page with "big5" in its URL. Obviously, big5 is not a language - they should be using zh-Hant.

- Facebook locale IDs [1]: Their basic format is "ll_CC", with language code and country code. But then they have ar_AR and es_LA for generic Arabic and Spanish. Country code AR is actually Argentina, and LA is Laos, but who cares. BCP 47 would let them use simply ar and es, or es-419 if they mean Latin American Spanish (UN M.49 codes for multi-country regions are another addition in BCP 47). Facebook also made up countries PI (Piratistan?), UD (Upside Downia?), EO (Esperantia?), as well as a locale fb_LT (Facebookian as used in Lithuania? No, leetspeak). All these language variants could be expressed cleanly in BCP 47 using variants or private-use subtags, without the need for made-up countries.


A few comments on the internationalization model for apps:

- The use cases on your page [2] generally seem a bit low level. I think you need to look at the interactions between users (especially multilingual ones), applications (monolingual and multilingual), user agent (generally set to one language), libraries such as the ECMAScript Internationalization API (multilingual), and web services (monolingual or multilingual, sometimes in multiple tiers). How should all these components interact so as to enable a consistent experience within an application in one of the user's preferred languages?

- I'd advise against the concept of "unlocalized" content in an application. Typically this really means content in the original language of the developer or author, but such content does have a language. Not knowing which language that is often causes problems, for example, when screen readers try to read the text to a blind user - correct pronunciation depends on knowing the language. Your design should really push developers towards declaring the language(s) they use. (Obviously, there are strings that are designed to be language independent, such as ISO 8601 date strings or BCP 47 language tags, but these are generally intended for internal use within software, not for display/reading to users).

- I'd also advise against the assumption that users will typically use applications in the same language as they've set the OS to use. Many users are multilingual, and will use applications in multiple languages - the more so if their primary language is not one of the most commonly supported ones. They may also use a secondary language for the OS if their primary language is not supported. Output of the ECMAScript Internationalization should generally match the language of the application, not the OS or user agent, so there needs to be a way to identify the language that the applications has chosen for its user interface.

- When interpreting BCP 47 language tags, be careful in dealing with the extensions. In the main part of a language tag, later subtags are interpreted as specializations of earlier ones: E.g., zh-Hans-CN is Chinese written in simplified script as used in China. The commonly used Lookup algorithm can therefore chop off subtags from the end as a fallback strategy. This does not work for the Unicode extension: its keys define separate dimensions, and whether, say, a particular collation is supported has nothing to do with the calendar. That's why the ResolveLocale operation in the ECMAScript Internationalization spec treats the Unicode extension separately and with a different algorithm from the main part of the language tag. Also, there are some keys that really should not be part of the locale: Which currency to use is a business decision and should never be left to the locale. The time zone usually should depend on where the user is or where an event occurs, and not be derived from the locale.


[1] https://www.facebook.com/translations/FacebookLocales.xml, referenced from
https://developers.facebook.com/docs/internationalization/
[2] https://gist.github.com/marcoscaceres/5055717

Cheers,
Norbert


On Mar 1, 2013, at 10:03 , Phillips, Addison wrote:

>> 
>> I don't know if anyone here can help me, but what I'd really like to find is data
>> that shows what Accept-Language: values are being transferred over the wire. I
>> know that this will not be completely representative [x]. But, if I can show that
>> at least some people are, by default or not, being excluded, it could weight
>> heavily towards swaying browser vendors.
>> 
>> [x] http://www.w3.org/International/questions/qa-accept-lang-locales
> 
> This article was written in *2003*, which is *before* the current BCP 47 came into use (2006). I'm appalled to find it still out there and I'm sure we'll revise it presently ;-).
> 
> If you want data indicating that "some users send three (or more) subtag language tags", all you need to do is open up Internet Explorer's options panel to language and scroll through the list. IE has generated A-L with three-subtag tags for a while now (not the Chinese ones, but quite a list of other languages).
> 
> Also, you should note that several locale systems infer additional subtags (particularly the Chinese ones) when they are not provided. 
> 
> Various browser vendors are slower to adopt additional language subtag combinations. However, a consistent set of locale identifiers and locale identification rules is called for here. A fallback system that just blithely skips over information that users have provided in their language ranges (particularly script information!) is profoundly unhelpful. Implementing the BCP 47 Lookup algorithm is only a minor elaboration on that.
> 
> However, I'll also point out that JavaScript's Intl extension also allows for another model for matching a language priority list (which is what Accept-Language is) that is considered an improvement by most implementers who have worked with it and that's what I would recommend for Sysapps.
> 
> (In another message on this thread you said:)
> 
>> 2) there are two implementations of the JSON i18n model (Google packaged
>> web apps and Mozilla's packaged web apps), so it's kinda already a de facto
>> standard. The model used by Google and Mozilla is what the SysApps WG is
>> trying to standardise on (hopefully without breaking existing content).
> 
> There is an implementation of Widget spec that I am aware of: Amazon Kindle/Kindle Fire uses it. Although it is not a perfect model, it would be painful to change it.
> 
>> 
>> Agreed. It would certainly make sense to align where possible. However, I'll
>> need to ask Norbert for guidance on this, as I haven't fully groked [3] yet.
>> 
> A better starting point for me to have given would probably have been:
> 
>   http://norbertlindenberg.com/2012/12/ecmascript-internationalization-api/index.html
> 
> ... where Norbert has put some text in English that is more grokable than the specification.
> 
> Addison
Received on Tuesday, 5 March 2013 06:42:34 UTC