Re: [IndexedDB] Languages for collation from Jeremy Orlow on 2010-08-12 (public-webapps@w3.org from July to September 2010)

From: Jeremy Orlow <jorlow@chromium.org>
Date: Thu, 12 Aug 2010 10:17:33 +0100
To: Pablo Castro <Pablo.Castro@microsoft.com>
Cc: Mikeal Rogers <mikeal.rogers@gmail.com>, public-webapps WG <public-webapps@w3.org>
Message-ID: <AANLkTin_8UpVSevdQrPZsDGq91RXnzy5zfyoWzhGn4ho@mail.gmail.com>

On Thu, Aug 12, 2010 at 8:28 AM, Pablo Castro <Pablo.Castro@microsoft.com>wrote:

>
> From: Mikeal Rogers [mailto:mikeal.rogers@gmail.com]
> Sent: Wednesday, August 11, 2010 11:35 PM
>
> >> Why not just use the unicode collation algorithm?
> >>
> >> Then you won't have to hint the locale.
>
> Unless I'm missing something, the UCA defines the general algorithm for
> collating strings but you still need to know the language in order to sort
> strings properly in that language. For example, in Spanish the letters "c"
> and "h"  together (e.g. in "chau" (bye)) sort as a single letter, causing
> the expected sort order to be different from English where they are always
> two independent letters (e.g. so "chau" comes before "cuando" (when) when
> sorted in English, but after when sorted in Spanish).
>
> >>
> >> http://en.wikipedia.org/wiki/Unicode_collation_algorithm
> >>
> >> CouchDB uses some definitions around sorting complex types like arrays
> and objects but when it comes down to sorting strings it just defaults to to
> the unicode collation algorithm and all the locale's are happy.
> >>
> >> -Mikeal
> >>
> >> On Wed, Aug 11, 2010 at 11:28 PM, Pablo Castro <
> Pablo.Castro@microsoft.com> wrote:
> >> We had some discussions about collation algorithms and such in the past,
> but I don't think we have settled on the language aspect of it. In order to
> have stores and indexes sort character-based keys in a way that is
> consistent with users' expectations we'll have to take indication in the API
> of what language we should use to collate strings.
> >>
> >> Trying to take a minimalist approach, we could add an optional parameter
> on the database open call that indicates the language to use (e.g. "en" or
> "en-UK", etc.). If the language is not specified and the database does not
> exist, then we can use the current browser/OS language to create the
> database. If not specified and database already exists, then use the one
> it's already there (this accommodates the fact that a user may be able to
> change their default language in the browser/OS after the database has been
> created using the default). If the language is specified and the database
> already exists and the specified language is not the one the database has
> then we'll throw an exception (same behavior as with "description", although
> we have that one in flight right now as well).
> >>
> >> We should probably also add a read-only attribute to the database object
> that exposes the language.
>

I think we should first break down the use cases and look at how many of
them just need _a_ sort order, how many of them a per-database sort order is
ok, and how many of them would need something finer grained (like a per-key
ordering).

Are there work-arounds for getting an UCA ordered data structure to hold
data other language's order?  For example, I could imagine it'd be possible
to do some sort of encode step on the data before insertion (and decode on
removal) that would make UCA work.  I have no idea, but if such algorithms
existed and were well understood, then it'd definitely make me lean towards
punting language specification to v2.

J


> >>
> >> If this works for folks I can write a proposal for the specific changes
> to the spec.
> >>
> >> Thanks
> >> -pablo
>
>
>
>

Received on Thursday, 12 August 2010 09:18:23 UTC