RE: [IndexedDB] Closing on bug 9903 (collations) from Pablo Castro on 2011-06-17 (public-webapps@w3.org from April to June 2011)

From: Pablo Castro <Pablo.Castro@microsoft.com>
Date: Fri, 17 Jun 2011 18:43:51 +0000
To: Keean Schupke <keean@fry-it.com>
CC: Aryeh Gregor <Simetrical+w3c@gmail.com>, Jonas Sicking <jonas@sicking.cc>, "public-webapps@w3.org" <public-webapps@w3.org>
Message-ID: <F108E2F6BA743C4696146F0B7111C26129597F@TK5EX14MBXC244.redmond.corp.microsoft.co>
From: keean.schupke@googlemail.com [mailto:keean.schupke@googlemail.com] On Behalf Of Keean Schupke
Sent: Tuesday, May 31, 2011 11:51 PM

>> On 1 June 2011 01:37, Pablo Castro <Pablo.Castro@microsoft.com> wrote:
>>
>> -----Original Message-----
>> From: simetrical@gmail.com [mailto:simetrical@gmail.com] On Behalf Of Aryeh Gregor
>> Sent: Tuesday, May 31, 2011 3:49 PM
>>
>> >> On Tue, May 31, 2011 at 6:39 PM, Pablo Castro
>> >> <Pablo.Castro@microsoft.com> wrote:
>> >> > No, that was poor wording on my part, I keep using "locale" in the wrong context. I meant to have the API take a proper collation identifier. The identifier can be as specific as the caller wants it to be. The implementation could choose to not honor some specific detail if it can't handle it (to the extent that doing so is allowed by the specification of collation names), or fail because it considers that not handling a particular aspect of the collation identifier would severely deviate from the caller's expectations.
>> >>
>> >> I'm not sure I understand you.  My personal opinion is that there
>> >> should be no undefined behavior here.  If authors are allowed to pass
>> >> collation identifiers, the spec needs to say exactly how they're to be
>> >> interpreted, so the same identifier passed to two different browsers
>> >> will result in the same collation, i.e., the same strings need to sort
>> >> the same cross-browser.  Having only binary collation is better than
>> >> having non-binary collations but not defining them, IMO.
>> I thought BCP47 allowed implementations to drop subtags if needed. I just re-read the spec and it seems that it only allows to do that in constrained cases where you can't fit the whole name in your buffer (which wouldn't apply to the context discussed here). My first instinct is that this is quite a bit to guarantee (full consistency in collation), but it seems that that's what the spec is shooting for.
>>
>> >> > Given the amount of debate on this, could we at least agree that we can do binary for v1? We can then have an open item for v2 on taking collation names and sort according to UCA or taking callbacks and such.
>> >>
>> >> I'm okay with supporting only binary to start with.
>> Great. I'll still wait a bit to see what other folks think, and then update the bug one way or the other.
>>
>> Thanks
>> -pablo
>>
>> The discussion sounds like it is headed in the right direction. Are there any issues with non-unicode encodings that need to be dealt with (HTTP headers default to ISO-8859 I think). Would people be expected to convert on read into UTF-16 strings or use typed-arrays?

I asked around here and folks actually pointed out that the JavaScript spec seems to be describing exactly what we needed. Looking at here [1], section 11.8.5, the relevant fragment starting at step 4 goes:

Else, both px and py are Strings
    a. If py is a prefix of px, return false. (A String value p is a prefix of String value q if q can be the result of concatenating p and some other String r. Note that any String is a prefix of itself, because r may be the empty String.)
    b. If px is a prefix of py, return true.
    c. Let k be the smallest nonnegative integer such that the character at position k within px is different from the character at position k within py. (There must be such a k, for neither String is a prefix of the other.)
    d. Let m be the integer that is the code unit value for the character at position k within px.
    e. Let n be the integer that is the code unit value for the character at position k within py.
    f. If m < n, return true. Otherwise, return false.

It also has a note below indicating:

NOTE 2 The comparison of Strings uses a simple lexicographic ordering on sequences of code unit values. There is no attempt to use the more complex, semantically oriented definitions of character or string equality and collating order defined in the Unicode specification. Therefore String values that are canonically equal according to the Unicode standard could test as unequal. In effect this algorithm assumes that both Strings are already in normalised form. Also, note that for strings containing supplementary characters, lexicographic ordering on sequences of UTF-16 code unit values differs from that on sequences of code point values.

Which is very much in line with what we've been discussing, and has the extra feature of being compatible with JavaScript order. 

So it looks like we could reference (or inline) this in the spec and have a fully specified order for keys with string content.

Thoughts? 

Thanks
-pablo

[1] http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-262.pdf
Received on Friday, 17 June 2011 18:44:21 UTC