Re: [IndexedDB] Closing on bug 9903 (collations)

From: Aryeh Gregor <Simetrical+w3c@gmail.com> · Date: Sun, 1 May 2011 18:35:07 -0400

On Fri, Apr 29, 2011 at 3:32 PM, Jonas Sicking <jonas@sicking.cc> wrote:
> I agree that we will eventually want to standardize the set of allowed
> collations. Similarly to how we'll want to standardize on one set of
> charset encodings supported. However I don't think we, in this spec
> community, have enough experience to come up with a good such set. So
> it's something that I think we should postpone for now. As I
> understand it there is work going on in this area in other groups, so
> hopefully we can lean on that work eventually.

(Disclaimer: I never really tried to figure out how IndexedDB works,
and I haven't seen the past discussion on this topic.  However, I know
a decent amount about database collations in practice from my work
with MediaWiki, which included adding collation support to category
pages last summer on a contract with Wikimedia.  Maybe everything I'm
saying has already been brought up before and/or everyone knows it
and/or it's wrong, in which case I apologize in advance.)

The Unicode Collation Algorithm is the standard here:

http://www.unicode.org/reports/tr10/

It's pretty stable (I think), and out of the box it provides *vastly*
better sorting than binary sort.  Binary sort doesn't even work for
English unless you normalize case and avoid punctuation marks, and
it's basically useless for most non-English languages.  Some type of
UCA support in browsers would be the way to go here.

UCA doesn't work perfectly for all locales, though, because different
locales sort the same strings differently (French handling of accents,
etc.).  The standard database of locale-specific collations is CLDR:

http://cldr.unicode.org/

CLDR tends to have several new releases per year.  For instance, 1.9.1
was released this March, three versions were released last year, and
five were released in 2009.  Just looking at the release notes, it
seems that most if not all of these releases update collation details.
 Because of how collations are actually used in databases, any change
to the collation version will require rebuilding any index that uses
that collation.

I don't think it's a good idea for browsers to try packaging such
rapidly-changing locale data.  If everyone had Chrome's release and
support schedule, it might work okay -- if you figured out a way to
handle updates gracefully -- but in practice, authors deal with a wide
range of browser ages.  It's not good if every user has a different
implementation of each collation.  Nor if browsers just use a frozen
and obsolescent collation version.  I also don't know how realistic
implementers would find it to ship collation support for every
language CLDR supports -- the CLDR download is a few megabytes zipped,
but I don't know how much of that browsers would need to ship to
support all its tailorings.

The general solution here would be to allow the creation of indexes
based on a user-supplied function.  I.e., the user-supplied function
would (in SQL terms) take the row's data as input, and output some
binary string.  That string would be used as the key in the index,
instead of any of the column values for the row.  PostgreSQL allows
this, or so I've heard.  Then you could implement UCA (optionally with
CLDR tailorings) or any other collation algorithm you liked in
JavaScript.

Of course, we can't expect authors to reimplement the UCA if they want
to get decent sorting.  It would make sense for browsers to expose
some default sort functions, but I'm not familiar enough with UCA or
CLDR to say which ones would be best in practice.  It might make sense
to expose some medium-level primitives that would allow authors to
easily overlay tailoring on the basic UCA algorithm, or something.  Or
maybe it would really make sense to expose all of CLDR's tailored
collations.  I'm not familiar enough with the specs to say.  But for
the sake of flexibility, allowing indexes based on user-defined
functions is the way to go.  (They're useful for things other than
collations, too.)

The proposed ECMAScript LocaleInfo.Collator looks like it doesn't
currently support this use-case, since it provides only sort functions
and not sortkey generation functions:

http://wiki.ecmascript.org/doku.php?id=strawman:i18n_api

If browsers do provide sortkey generation functions based on UCA, some
versioning mechanism will need to be used, particularly if it supports
tailored sortkeys.

FWIW, MySQL provides some built-in collation support, but MediaWiki
doesn't use it, because it supports too few languages and is too
inflexible.  MediaWiki's stock localization has 99% support for the
500 most-used messages in 175 different languages, and the couple
dozen locales that MySQL supports aren't acceptable for us.  Instead,
we store everything with a binary collation, and are moving to a
system where we compute the UCA sortkeys ourselves and put them in
their own column, which we use for sorting.  MediaWiki's i18n people
can be reached in #mediawiki-i18n on freenode or the Mediawiki-i18n
list <https://lists.wikimedia.org/mailman/listinfo/mediawiki-i18n> --
some of them know a fair bit about things like CLDR and would be happy
to provide advice, if expertise is needed.  (But I imagine people at
the Unicode Consortium would be the best ones to ask!)