Re: Language ranges with more than two sub-tag from Marcos Caceres on 2013-03-12 (www-international@w3.org from January to March 2013)

From: Marcos Caceres <w3c@marcosc.com>
Date: Tue, 12 Mar 2013 10:07:56 +0000
To: Norbert Lindenberg <w3@norbertlindenberg.com>
Cc: Addison Phillips <addison@lab126.com>, www-international <www-international@w3.org>
Message-ID: <142AB2D6328D46A890E55C65E995DA68@marcosc.com>
(sending this again, as it didn't end up in the archive… apologies if it does eventually show up here)


On Tuesday, 5 March 2013 at 12:36, Marcos Caceres wrote:

> Hi Norbert,  
>  
> On Tuesday, 5 March 2013 at 06:42, Norbert Lindenberg wrote:
>  
> > Hi Marcos,
> >  
> > You shouldn't just be looking at current use of language tags with more than language and country, but also at use cases where developers use private hacks for which BCP 47 offers clean, interoperable alternatives. (It could be that these developers don't know about BCP 47 yet or that they're using old infrastructure that doesn't support it yet).
>  
> Agreed.  
> > Examples:
> >  
> > - The web site of the Hong Kong government, http://www.gov.hk, provides content in both traditional and simplified Chinese - note the 繁體 and 简体 in the upper right corner. People used to use zh-HK (or zh-TW) to mean traditional Chinese, but that obviously doesn't work here anymore. The web site uses its own non-standard tags, tc and sc. Instead, they should be using the BCP 47 tags zh-Hant and zh-Hans (or zh-Hant-HK and zh-Hans-HK.
> >  
> > - The web site of Sinovision, a Chinese-language broadcaster in the US, http://www.sinovision.net. If you click on its 繁体 link, you get to a page with "big5" in its URL. Obviously, big5 is not a language - they should be using zh-Hant.
>  
> Right, but in the above cases no automated content negotiation is taking place. The user explicitly makes a choice there to view the content in a particular language or script by clicking a link. So, how the developer chooses to represent the organisation of their content through the URL doesn't affect end-users (though it may raise the eyebrows of i18n folks:)). If, on the other hand, the web application was expecting a HTTP "Accept-Language: big5" to perform content negotiation, then it would be a much bigger issue.  
>  
> Hypothetically speaking, sinovision.net (http://sinovision.net) could have checked the Accept-Language to see which of zh-Hant or zh-Hans the user prefers and served them the preferred content (while also providing a link to swap to the other script choice). The resulting URL, although a little weird in the choice to use the name of the encoding "/big5/", doesn't really affect the end-user's ability to access the content.  
>  
> The case I am interested in is one that relates explicitly to content negotiation occurring between the user agent's default language and resources whose selection occurs as part of an automated process.  
> > - Facebook locale IDs [1]: Their basic format is "ll_CC", with language code and country code. But then they have ar_AR and es_LA for generic Arabic and Spanish. Country code AR is actually Argentina, and LA is Laos, but who cares. BCP 47 would let them use simply ar and es, or es-419 if they mean Latin American Spanish (UN M.49 codes for multi-country regions are another addition in BCP 47). Facebook also made up countries PI (Piratistan?), UD (Upside Downia?), EO (Esperantia?), as well as a locale fb_LT (Facebookian as used in Lithuania? No, leetspeak). All these language variants could be expressed cleanly in BCP 47 using variants or private-use subtags, without the need for made-up countries.
>  
>  
>  
> Understood. It's likely they just didn't know about BCP 47 and the nice organisational structures it provides. Language tags are extremely powerful, but can be tricky to grok (specially in a world where developers are constantly exposed to subtly different variants - like in Google's packaged apps they use and "_" instead of a "-" for sub tag separators).  
>  
> I guess what the above shows is that this is currently really hard for developers to understand - even really good ones, as I assume the ones at Facebook are. <More education needed here> :)
> > A few comments on the internationalization model for apps:
> >  
> > - The use cases on your page [2] generally seem a bit low level. I think you need to look at the interactions between users (especially multilingual ones), applications (monolingual and multilingual), user agent (generally set to one language), libraries such as the ECMAScript Internationalization API (multilingual), and web services (monolingual or multilingual, sometimes in multiple tiers). How should all these components interact so as to enable a consistent experience within an application in one of the user's preferred languages?
>  
> Agreed. This is exactly what I'm trying to do :) However, if I can't convince implementers that there users out there using browsers whose language preferences have more than two sub-tags (e.g. [1]), this is going to be a struggle. If there is commitment from browser vendors to implement the ECMAScript Internationalization API, then that's a lot of infrastructure that can be leveraged to provide technical answer the questions above.  
>  
> [1] https://bugzilla.mozilla.org/show_bug.cgi?id=846269
>  
> > - I'd advise against the concept of "unlocalized" content in an application.
> True. This is a misnomer on my part - agree completely with what you say below.  
> > Typically this really means content in the original language of the developer or author, but such content does have a language. Not knowing which language that is often causes problems, for example, when screen readers try to read the text to a blind user - correct pronunciation depends on knowing the language. Your design should really push developers towards declaring the language(s) they use. (Obviously, there are strings that are designed to be language independent, such as ISO 8601 date strings or BCP 47 language tags, but these are generally intended for internal use within software, not for display/reading to users).
>  
>  
>  
> Agreed. Although FirefoxOS does not require the "default_locale" to be declared for an application to run, Mozilla does force the "default_locale" to be present in the manifest for when applications are listed in their store for some of the reasons above.  
>  
> Forcing developers to declare the language at the UA level (by some draconian method)  
> > - I'd also advise against the assumption that users will typically use applications in the same language as they've set the OS to use.
>  
>  
>  
> Right. This is not an assumption I am making.  
> > Many users are multilingual, and will use applications in multiple languages - the more so if their primary language is not one of the most commonly supported ones. They may also use a secondary language for the OS if their primary language is not supported. Output of the ECMAScript Internationalization should generally match the language of the application, not the OS or user agent, so there needs to be a way to identify the language that the applications has chosen for its user interface.
>  
>  
>  
> This is kinda problematic given FxOS's current i18n model in that there may be more than one language used per application (or, as already discussed, there may not be a default language identified at all). Consider:
>  
> {
> "name": "The foo app!"
> "locales":{
> "en-US": {description: "The foo app…."}
> "en": {developer: "me"}
> }
> developer: "我"  
> default_locale: "zh-Hans"
> }  
>  
> In the above case, the app's name is chosen from the "zh-Hans" localised content, while the description is only chosen for users of "en-US". While the developer can be in either. This mixing and matching yields interesting results and, when use well, can avoid needing to repeat some data.  
>  
> Given that, what could be exposed through an interface to the JS environment is the list of languages that were used in the selection process, in order. So, given a UA language preference of:
>  
> "en-US, af, en-AU, jp"  
>  
> And an application manifest's content being available in:
>  
> "en, zh-Hans, jp" - where jp is the default locale.  
>  
> Then the resulting "application.lang", after lookup is applied, would be ["en","jp"] (representing the language tags from which content could have *potentially* been chosen, in order - and always includes the default locale). I say potentially because the author may have provided a complete localisation of all the content for each matching language tag; hence, "jp" localised content might not have been used at all.
>  
>  
> > - When interpreting BCP 47 language tags, be careful in dealing with the extensions. In the main part of a language tag, later subtags are interpreted as specializations of earlier ones: E.g., zh-Hans-CN is Chinese written in simplified script as used in China. The commonly used Lookup algorithm can therefore chop off subtags from the end as a fallback strategy. This does not work for the Unicode extension: its keys define separate dimensions, and whether, say, a particular collation is supported has nothing to do with the calendar. That's why the ResolveLocale operation in the ECMAScript Internationalization spec treats the Unicode extension separately and with a different algorithm from the main part of the language tag.  
> For when I specify the i18n model, I'm hoping to be able to defer to the ECMAScript Internationalization spec's abstract operations to handle both canonicalisation and ResolveLocale (and hopefully a few other things).  
> > Also, there are some keys that really should not be part of the locale: Which currency to use is a business decision and should never be left to the locale. The time zone usually should depend on where the user is or where an event occurs, and not be derived from the locale.
>  
>  
>  
> Right. The above are in-app decisions. Undoubtedly, this requires a good understanding of localising software.  
>  
> Kind regards,
> Marcos
Received on Tuesday, 12 March 2013 10:08:31 UTC