- From: Jeremy Orlow <jorlow@chromium.org>
- Date: Thu, 12 Aug 2010 10:17:33 +0100
- To: Pablo Castro <Pablo.Castro@microsoft.com>
- Cc: Mikeal Rogers <mikeal.rogers@gmail.com>, public-webapps WG <public-webapps@w3.org>
- Message-ID: <AANLkTin_8UpVSevdQrPZsDGq91RXnzy5zfyoWzhGn4ho@mail.gmail.com>
On Thu, Aug 12, 2010 at 8:28 AM, Pablo Castro <Pablo.Castro@microsoft.com>wrote: > > From: Mikeal Rogers [mailto:mikeal.rogers@gmail.com] > Sent: Wednesday, August 11, 2010 11:35 PM > > >> Why not just use the unicode collation algorithm? > >> > >> Then you won't have to hint the locale. > > Unless I'm missing something, the UCA defines the general algorithm for > collating strings but you still need to know the language in order to sort > strings properly in that language. For example, in Spanish the letters "c" > and "h" together (e.g. in "chau" (bye)) sort as a single letter, causing > the expected sort order to be different from English where they are always > two independent letters (e.g. so "chau" comes before "cuando" (when) when > sorted in English, but after when sorted in Spanish). > > >> > >> http://en.wikipedia.org/wiki/Unicode_collation_algorithm > >> > >> CouchDB uses some definitions around sorting complex types like arrays > and objects but when it comes down to sorting strings it just defaults to to > the unicode collation algorithm and all the locale's are happy. > >> > >> -Mikeal > >> > >> On Wed, Aug 11, 2010 at 11:28 PM, Pablo Castro < > Pablo.Castro@microsoft.com> wrote: > >> We had some discussions about collation algorithms and such in the past, > but I don't think we have settled on the language aspect of it. In order to > have stores and indexes sort character-based keys in a way that is > consistent with users' expectations we'll have to take indication in the API > of what language we should use to collate strings. > >> > >> Trying to take a minimalist approach, we could add an optional parameter > on the database open call that indicates the language to use (e.g. "en" or > "en-UK", etc.). If the language is not specified and the database does not > exist, then we can use the current browser/OS language to create the > database. If not specified and database already exists, then use the one > it's already there (this accommodates the fact that a user may be able to > change their default language in the browser/OS after the database has been > created using the default). If the language is specified and the database > already exists and the specified language is not the one the database has > then we'll throw an exception (same behavior as with "description", although > we have that one in flight right now as well). > >> > >> We should probably also add a read-only attribute to the database object > that exposes the language. > I think we should first break down the use cases and look at how many of them just need _a_ sort order, how many of them a per-database sort order is ok, and how many of them would need something finer grained (like a per-key ordering). Are there work-arounds for getting an UCA ordered data structure to hold data other language's order? For example, I could imagine it'd be possible to do some sort of encode step on the data before insertion (and decode on removal) that would make UCA work. I have no idea, but if such algorithms existed and were well understood, then it'd definitely make me lean towards punting language specification to v2. J > >> > >> If this works for folks I can write a proposal for the specific changes > to the spec. > >> > >> Thanks > >> -pablo > > > >
Received on Thursday, 12 August 2010 09:18:23 UTC