RE: [IndexedDB] Spec changes for international language support from Pablo Castro on 2011-02-24 (public-webapps@w3.org from January to March 2011)

From: Pablo Castro <Pablo.Castro@microsoft.com>
Date: Thu, 24 Feb 2011 02:46:44 +0000
To: Jungshik Shin (신정식, 申政湜) <jshin@chromium.org>, Bjoern Hoehrmann <derhoermi@gmx.net>
CC: public-webapps WG <public-webapps@w3.org>
Message-ID: <F108E2F6BA743C4696146F0B7111C261072034@TK5EX14MBXC245.redmond.corp.microsoft.co>


From: jungshik@google.com [mailto:jungshik@google.com] On Behalf Of Jungshik Shin (???, ???)
Sent: Tuesday, February 22, 2011 2:08 PM


>> On Fri, Feb 18, 2011 at 2:34 AM, Bjoern Hoehrmann <derhoermi@gmx.net> wrote:
>> * Pablo Castro wrote:
>> >We discussed international language support last time at the TPAC and I
>> >said I'd propose spec text for it. Please find the patch below, the
>> >changes mirror exactly the proposal described in the bug we have for
>> >tracking this: http://www.w3.org/Bugs/Public/show_bug.cgi?id=9903

>> You should anticipate objections to that; collation is not a property of
>> language, for instance, for de-de you typically have dictionary sorting
>> and phone book sorting (and of course you have "de-de", "de-ch", and so
>> on, so "de" alone would be rather meaningless). So far the W3C and the
>> IETF have used resource identifiers to specify collations (see XPath 2.0
>> and RFC 4790) where the IETF allows shorthands like "i;ascii-casemap".
>>
>> I agree that simply specifying that 'language' be used without saying what it means is not sufficient. However, your examples (German phonebook vs dictionary) can be >> covered with language identifier framework laid out in BCP47 (with 'u' extension). 

Fair enough. I'll adjust this part of the write up to discuss this in terms of "collation identifier" or "language identifier".

>> I do understand that Microsoft uses an extension of language tags for
>> the `CultureInfo` in the .NET Framework, where, say, `de-DE_phoneb` is
>> used to refer to german phone book sorting, but BCP 47 does not allow
>> for that, 
>>
>> There's a way to specify alternate sorting orders (e.g. German phonebook, Chinese pinyin, stroke count, radical-stroke count order, etc) under the BCP 47 framework >> because it has a mechanism for defining an extension and registering it. The Unicode consortium uses that mechanism to define 'u' extension and a set of subtags that can >> be used with 'u'. 
>> For instance, German phonebook sorting can be identified with 'de-DE-u-co-phonebk'. See 
>>
>> https://tools.ietf.org/html/bcp47

>> https://tools.ietf.org/html/rfc6067

>> http://unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers

>>
>> Also, see Bug 9903 comment 6 by Mark Davis for more examples. Well, I'm just copying his comment directly here:
>>
>>
>> To add to what Jungshik said, BCP47 defines standard extensions. The extension
>> defined by the Unicode consortium
>> (http://cldr.unicode.org/index/bcp47-extension) provides for fine-grained
>>
>> specifications of collation behavior.
>> Examples for German:
>> de-u-co-phonebk // phonebook order
>> de-u-kn-true // numeric sorting, eg Tom2 comes before Tom12
>> de-u-ks-level1 // ignore accents, case differences
>> de-u-ks-level2 // ignore case differences
>> de-u-ks-level1-kc-true // ignore accents, but not case
>> These can be combined, such as:
>> de-u-co-phonebk-kn-true-ks-level1-kc-true
>> 
>> neither could you devise a language tag to define something
>> like "i;ascii-casemap" (which simply defines A-Z = a-z).
>>

I'm not sure how specific we want to get into this. In particular, would be it better if we specified it all the way (including which extensions UAs need to support) or if we used BCP47 as the starting point and allowed UAs to support additional extensions as needed?

>> I would expect that if browsers offer collations, there would be an in-
>> terface for that so you can use them in other places, as such it might
>> be wiser to accept something other than a language identifier string. 
>>
>> There's an on-going effort to expose a 'rich' set of I18N API to client-side development using Javascript ( http://wiki.ecmascript.org/doku.php?id=strawman:i18n_api : The API used be much more extensive than now, but has been scaled down significantly to get more browsers on board in its 1st iteration). There we're likely to use BCP 47 with 'u' extension (see above). So, I think it'd be better if IndexedDB matches what ECMAScript plans to do. 

This is interesting, do you know how far along is this?


>> I also note that collation often involves equivalence testing, but it
>> is not clear from your proposal whether that is the case here. It might
>> also be a good idea to clearly spell out interoperability expectations;
>> if two implementations support some collation, will they behave the same
>> for any and all inputs as far as collation is concerned, or should one
>> be prepared for slight differences among implementations?

I think it's more practical to assume that users should be prepared for slight differences among implementations.

Thanks
-pablo

Received on Thursday, 24 February 2011 02:47:18 UTC