Re: [IndexedDB] Closing on bug 9903 (collations)

From: Keean Schupke <keean@fry-it.com> · Date: Mon, 2 May 2011 10:49:09 +0100

On Sunday, 1 May 2011, Aryeh Gregor <Simetrical+w3c@gmail.com> wrote:
> On Fri, Apr 29, 2011 at 3:32 PM, Jonas Sicking <jonas@sicking.cc> wrote:
>> I agree that we will eventually want to standardize the set of allowed
>> collations. Similarly to how we'll want to standardize on one set of
>> charset encodings supported. However I don't think we, in this spec
>> community, have enough experience to come up with a good such set. So
>> it's something that I think we should postpone for now. As I
>> understand it there is work going on in this area in other groups, so
>> hopefully we can lean on that work eventually.
>
> (Disclaimer: I never really tried to figure out how IndexedDB works,
> and I haven't seen the past discussion on this topic.  However, I know
> a decent amount about database collations in practice from my work
> with MediaWiki, which included adding collation support to category
> pages last summer on a contract with Wikimedia.  Maybe everything I'm
> saying has already been brought up before and/or everyone knows it
> and/or it's wrong, in which case I apologize in advance.)
>
> The Unicode Collation Algorithm is the standard here:
>
> http://www.unicode.org/reports/tr10/
>
> It's pretty stable (I think), and out of the box it provides *vastly*
> better sorting than binary sort.  Binary sort doesn't even work for
> English unless you normalize case and avoid punctuation marks, and
> it's basically useless for most non-English languages.  Some type of
> UCA support in browsers would be the way to go here.
>
> UCA doesn't work perfectly for all locales, though, because different
> locales sort the same strings differently (French handling of accents,
> etc.).  The standard database of locale-specific collations is CLDR:
>
> http://cldr.unicode.org/
>
> CLDR tends to have several new releases per year.  For instance, 1.9.1
> was released this March, three versions were released last year, and
> five were released in 2009.  Just looking at the release notes, it
> seems that most if not all of these releases update collation details.
>  Because of how collations are actually used in databases, any change
> to the collation version will require rebuilding any index that uses
> that collation.
>
> I don't think it's a good idea for browsers to try packaging such
> rapidly-changing locale data.  If everyone had Chrome's release and
> support schedule, it might work okay -- if you figured out a way to
> handle updates gracefully -- but in practice, authors deal with a wide
> range of browser ages.  It's not good if every user has a different
> implementation of each collation.  Nor if browsers just use a frozen
> and obsolescent collation version.  I also don't know how realistic
> implementers would find it to ship collation support for every
> language CLDR supports -- the CLDR download is a few megabytes zipped,
> but I don't know how much of that browsers would need to ship to
> support all its tailorings.
>
> The general solution here would be to allow the creation of indexes
> based on a user-supplied function.  I.e., the user-supplied function
> would (in SQL terms) take the row's data as input, and output some
> binary string.  That string would be used as the key in the index,
> instead of any of the column values for the row.  PostgreSQL allows
> this, or so I've heard.  Then you could implement UCA (optionally with
> CLDR tailorings) or any other collation algorithm you liked in
> JavaScript.
>
> Of course, we can't expect authors to reimplement the UCA if they want
> to get decent sorting.  It would make sense for browsers to expose
> some default sort functions, but I'm not familiar enough with UCA or
> CLDR to say which ones would be best in practice.  It might make sense
> to expose some medium-level primitives that would allow authors to
> easily overlay tailoring on the basic UCA algorithm, or something.  Or
> maybe it would really make sense to expose all of CLDR's tailored
> collations.  I'm not familiar enough with the specs to say.  But for
> the sake of flexibility, allowing indexes based on user-defined
> functions is the way to go.  (They're useful for things other than
> collations, too.)
>
> The proposed ECMAScript LocaleInfo.Collator looks like it doesn't
> currently support this use-case, since it provides only sort functions
> and not sortkey generation functions:
>
> http://wiki.ecmascript.org/doku.php?id=strawman:i18n_api
>
> If browsers do provide sortkey generation functions based on UCA, some
> versioning mechanism will need to be used, particularly if it supports
> tailored sortkeys.
>
>
> FWIW, MySQL provides some built-in collation support, but MediaWiki
> doesn't use it, because it supports too few languages and is too
> inflexible.  MediaWiki's stock localization has 99% support for the
> 500 most-used messages in 175 different languages, and the couple
> dozen locales that MySQL supports aren't acceptable for us.  Instead,
> we store everything with a binary collation, and are moving to a
> system where we compute the UCA sortkeys ourselves and put them in
> their own column, which we use for sorting.  MediaWiki's i18n people
> can be reached in #mediawiki-i18n on freenode or the Mediawiki-i18n
> list <https://lists.wikimedia.org/mailman/listinfo/mediawiki-i18n> --
> some of them know a fair bit about things like CLDR and would be happy
> to provide advice, if expertise is needed.  (But I imagine people at
> the Unicode Consortium would be the best ones to ask!)
>

I agree with this. My thoughts are, we need a binary sort mode as not
all data is text. A binary sort that operates on arrays of bytes,
comparing bytes first to last (like a simple ASCII string compare but
with all values 0 - 255 being valid, using a length and not zero
terminated). You can then provide a mapping function from any ordering
to this standard representation (this is not length preserving) which
allows for fast sorting and searching. (fast because you convert the
query string to the standard representation, and then use the standard
binary comparison which is simple and fast). The index only needs to
store the standard representation.

Cheers,
Keean.