Re: [IndexedDB] Spec changes for international language support

2011/2/23 Pablo Castro <Pablo.Castro@microsoft.com>:
>
> From: jungshik@google.com [mailto:jungshik@google.com] On Behalf Of Jungshik Shin (???, ???)
> Sent: Tuesday, February 22, 2011 2:08 PM
>
>
>>> On Fri, Feb 18, 2011 at 2:34 AM, Bjoern Hoehrmann <derhoermi@gmx.net> wrote:
>>> * Pablo Castro wrote:
>>> >We discussed international language support last time at the TPAC and I
>>> >said I'd propose spec text for it. Please find the patch below, the
>>> >changes mirror exactly the proposal described in the bug we have for
>>> >tracking this: http://www.w3.org/Bugs/Public/show_bug.cgi?id=9903
>>> You should anticipate objections to that; collation is not a property of
>>> language, for instance, for de-de you typically have dictionary sorting
>>> and phone book sorting (and of course you have "de-de", "de-ch", and so
>>> on, so "de" alone would be rather meaningless). So far the W3C and the
>>> IETF have used resource identifiers to specify collations (see XPath 2.0
>>> and RFC 4790) where the IETF allows shorthands like "i;ascii-casemap".
>>>
>>> I agree that simply specifying that 'language' be used without saying what it means is not sufficient. However, your examples (German phonebook vs dictionary) can be >> covered with language identifier framework laid out in BCP47 (with 'u' extension).
>
> Fair enough. I'll adjust this part of the write up to discuss this in terms of "collation identifier" or "language identifier".
>
>>> I do understand that Microsoft uses an extension of language tags for
>>> the `CultureInfo` in the .NET Framework, where, say, `de-DE_phoneb` is
>>> used to refer to german phone book sorting, but BCP 47 does not allow
>>> for that,
>>>
>>> There's a way to specify alternate sorting orders (e.g. German phonebook, Chinese pinyin, stroke count, radical-stroke count order, etc) under the BCP 47 framework >> because it has a mechanism for defining an extension and registering it. The Unicode consortium uses that mechanism to define 'u' extension and a set of subtags that can >> be used with 'u'.
>>> For instance, German phonebook sorting can be identified with 'de-DE-u-co-phonebk'. See
>>>
>>> https://tools.ietf.org/html/bcp47
>>> https://tools.ietf.org/html/rfc6067
>>> http://unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers
>>>
>>> Also, see Bug 9903 comment 6 by Mark Davis for more examples. Well, I'm just copying his comment directly here:
>>>
>>>
>>> To add to what Jungshik said, BCP47 defines standard extensions. The extension
>>> defined by the Unicode consortium
>>> (http://cldr.unicode.org/index/bcp47-extension) provides for fine-grained
>>>
>>> specifications of collation behavior.
>>> Examples for German:
>>> de-u-co-phonebk // phonebook order
>>> de-u-kn-true // numeric sorting, eg Tom2 comes before Tom12
>>> de-u-ks-level1 // ignore accents, case differences
>>> de-u-ks-level2 // ignore case differences
>>> de-u-ks-level1-kc-true // ignore accents, but not case
>>> These can be combined, such as:
>>> de-u-co-phonebk-kn-true-ks-level1-kc-true
>>>
>>> neither could you devise a language tag to define something
>>> like "i;ascii-casemap" (which simply defines A-Z = a-z).
>>>
>
> I'm not sure how specific we want to get into this. In particular, would be it better if we specified it all the way (including which extensions UAs need to support) or if we used BCP47 as the starting point and allowed UAs to support additional extensions as needed?

I think for now we should allow implementations to support additional
collations in additions to whatever set we specify. It seems to me
that this is an area that is heavily in flux and I'd hate to paint
ourselves into a corner.

>>> I would expect that if browsers offer collations, there would be an in-
>>> terface for that so you can use them in other places, as such it might
>>> be wiser to accept something other than a language identifier string.
>>>
>>> There's an on-going effort to expose a 'rich' set of I18N API to client-side development using Javascript ( http://wiki.ecmascript.org/doku.php?id=strawman:i18n_api : The API used be much more extensive than now, but has been scaled down significantly to get more browsers on board in its 1st iteration). There we're likely to use BCP 47 with 'u' extension (see above). So, I think it'd be better if IndexedDB matches what ECMAScript plans to do.
>
> This is interesting, do you know how far along is this?

And does someone have a link to drafts?

I suspect we don't want to wait for this work to finish, but we should
definitely track it and seek inspiration. And there are probably
people there that can review whatever we're doing.

>>> I also note that collation often involves equivalence testing, but it
>>> is not clear from your proposal whether that is the case here. It might
>>> also be a good idea to clearly spell out interoperability expectations;
>>> if two implementations support some collation, will they behave the same
>>> for any and all inputs as far as collation is concerned, or should one
>>> be prepared for slight differences among implementations?
>
> I think it's more practical to assume that users should be prepared for slight differences among implementations.

Doesn't the various specs define a strict sorting order for the
collations that they do define? If so it seems like implementations
should get things in the same order. Modulo bugs of course, but that
shouldn't affect the spec.

But in general I wouldn't be surprised if there are quality of
implementation issues for a while on this, which will lead to
differences in implementation behavior. I think we should defer to
other specs for the specification of the actual ordering and consider
it bugs in the implementation of those specs though.

All in all, is there anything preventing adding the API Pablo suggests
in this thread to the IndexedDB spec drafts?

/ Jonas

Received on Tuesday, 8 March 2011 21:11:33 UTC