- From: Aryeh Gregor <Simetrical+w3c@gmail.com>
- Date: Sun, 1 May 2011 18:35:07 -0400
- To: Jonas Sicking <jonas@sicking.cc>
- Cc: Keean Schupke <keean@fry-it.com>, Pablo Castro <Pablo.Castro@microsoft.com>, "public-webapps@w3.org" <public-webapps@w3.org>
On Fri, Apr 29, 2011 at 3:32 PM, Jonas Sicking <jonas@sicking.cc> wrote: > I agree that we will eventually want to standardize the set of allowed > collations. Similarly to how we'll want to standardize on one set of > charset encodings supported. However I don't think we, in this spec > community, have enough experience to come up with a good such set. So > it's something that I think we should postpone for now. As I > understand it there is work going on in this area in other groups, so > hopefully we can lean on that work eventually. (Disclaimer: I never really tried to figure out how IndexedDB works, and I haven't seen the past discussion on this topic. However, I know a decent amount about database collations in practice from my work with MediaWiki, which included adding collation support to category pages last summer on a contract with Wikimedia. Maybe everything I'm saying has already been brought up before and/or everyone knows it and/or it's wrong, in which case I apologize in advance.) The Unicode Collation Algorithm is the standard here: http://www.unicode.org/reports/tr10/ It's pretty stable (I think), and out of the box it provides *vastly* better sorting than binary sort. Binary sort doesn't even work for English unless you normalize case and avoid punctuation marks, and it's basically useless for most non-English languages. Some type of UCA support in browsers would be the way to go here. UCA doesn't work perfectly for all locales, though, because different locales sort the same strings differently (French handling of accents, etc.). The standard database of locale-specific collations is CLDR: http://cldr.unicode.org/ CLDR tends to have several new releases per year. For instance, 1.9.1 was released this March, three versions were released last year, and five were released in 2009. Just looking at the release notes, it seems that most if not all of these releases update collation details. Because of how collations are actually used in databases, any change to the collation version will require rebuilding any index that uses that collation. I don't think it's a good idea for browsers to try packaging such rapidly-changing locale data. If everyone had Chrome's release and support schedule, it might work okay -- if you figured out a way to handle updates gracefully -- but in practice, authors deal with a wide range of browser ages. It's not good if every user has a different implementation of each collation. Nor if browsers just use a frozen and obsolescent collation version. I also don't know how realistic implementers would find it to ship collation support for every language CLDR supports -- the CLDR download is a few megabytes zipped, but I don't know how much of that browsers would need to ship to support all its tailorings. The general solution here would be to allow the creation of indexes based on a user-supplied function. I.e., the user-supplied function would (in SQL terms) take the row's data as input, and output some binary string. That string would be used as the key in the index, instead of any of the column values for the row. PostgreSQL allows this, or so I've heard. Then you could implement UCA (optionally with CLDR tailorings) or any other collation algorithm you liked in JavaScript. Of course, we can't expect authors to reimplement the UCA if they want to get decent sorting. It would make sense for browsers to expose some default sort functions, but I'm not familiar enough with UCA or CLDR to say which ones would be best in practice. It might make sense to expose some medium-level primitives that would allow authors to easily overlay tailoring on the basic UCA algorithm, or something. Or maybe it would really make sense to expose all of CLDR's tailored collations. I'm not familiar enough with the specs to say. But for the sake of flexibility, allowing indexes based on user-defined functions is the way to go. (They're useful for things other than collations, too.) The proposed ECMAScript LocaleInfo.Collator looks like it doesn't currently support this use-case, since it provides only sort functions and not sortkey generation functions: http://wiki.ecmascript.org/doku.php?id=strawman:i18n_api If browsers do provide sortkey generation functions based on UCA, some versioning mechanism will need to be used, particularly if it supports tailored sortkeys. FWIW, MySQL provides some built-in collation support, but MediaWiki doesn't use it, because it supports too few languages and is too inflexible. MediaWiki's stock localization has 99% support for the 500 most-used messages in 175 different languages, and the couple dozen locales that MySQL supports aren't acceptable for us. Instead, we store everything with a binary collation, and are moving to a system where we compute the UCA sortkeys ourselves and put them in their own column, which we use for sorting. MediaWiki's i18n people can be reached in #mediawiki-i18n on freenode or the Mediawiki-i18n list <https://lists.wikimedia.org/mailman/listinfo/mediawiki-i18n> -- some of them know a fair bit about things like CLDR and would be happy to provide advice, if expertise is needed. (But I imagine people at the Unicode Consortium would be the best ones to ask!)
Received on Sunday, 1 May 2011 22:35:55 UTC