Re: [IndexedDB] Closing on bug 9903 (collations) from Jonas Sicking on 2011-05-06 (public-webapps@w3.org from April to June 2011)

From: Jonas Sicking <jonas@sicking.cc>
Date: Thu, 5 May 2011 19:00:29 -0700
To: Keean Schupke <keean@fry-it.com>
Cc: Aryeh Gregor <Simetrical+w3c@gmail.com>, Pablo Castro <Pablo.Castro@microsoft.com>, "public-webapps@w3.org" <public-webapps@w3.org>
Message-ID: <BANLkTimpX5sCta4uNOBDbOeeHJee8Bn+xA@mail.gmail.com>
On Wed, May 4, 2011 at 11:12 PM, Keean Schupke <keean@fry-it.com> wrote:
> On 5 May 2011 00:33, Aryeh Gregor <Simetrical+w3c@gmail.com> wrote:
>>
>> On Tue, May 3, 2011 at 7:57 PM, Jonas Sicking <jonas@sicking.cc> wrote:
>> > I don't think we should do callbacks for the first version of
>> > javascript. It gets very messy since we can't rely on that the script
>> > function will be returning stable values.
>>
>> The worst that would happen if it didn't return stable values is that
>> sorting would return unpredictable results.
>
> Worst is an infinite loop - no return.
>
>>
>> > So the choice here really is between only supporting some form of
>> > binary sorting, or supporting a built-in set of collations. Anything
>> > else will have to wait for version 2 in my opinion.
>>
>> I think it would be a mistake to try supporting a limited set of
>> natural-language collations.  Binary collation is fine for a first
>> version.  MySQL only supported binary collation up through version 4,
>> for instance.
>
> A good point about MySQL.
>
>>
>> On Wed, May 4, 2011 at 3:49 AM, Keean Schupke <keean@fry-it.com> wrote:
>> > I thought only the app that created the db could open it (for security
>> > reasons)... so it becomes the app's responsibility to do version
>> > control.
>> > The comparison function is not going to change by itself - someone has
>> > to go
>> > into the code and change it, when they do that they should up the
>> > revision
>> > of the database, if that change is incompatible.
>>
>> Why should we let such a pitfall exist if we can just store the
>> function and avoid the issue?
>
> I don't see it as a pitfall, it is an has the advantage of transparency.
>
>>
>> > There is exactly the same problem with object properties. If the app
>> > changes
>> > to expect a new property on all objects stored, then the app has to
>> > correctly deal with the update.
>>
>> If a requested property doesn't exist, I assume the API will fail
>> immediately with a clear error code.  It will not fail silently and
>> mysteriously with no error code.  (Again, I haven't looked at it
>> closely, or tried to use it.)
>
> What if the new version uses the same property name for a different thing?
> For example in V1 'Employer' is a string name, and in V2 'Employer' is a
> reference to another object. You may say 'you should change the column
> name'? Right thats just the same as me saying you should change the DB
> version number when you change the collation algorithm. Its the same thing.
> People seem to be making a big fuss about having a non-persisted collation
> function defined in user code, when many many things require the code to
> have the correct model of the data stored in the database to work properly.
> It seems illogical to make a special case for this function, and not do
> anything about all the other cases. IMHO either the database should have a
> stored schema, or it should not. If IndexedDB is going the direction of not
> having a stored schema, then the designers should have the confidence in
> their decision to stick with it and at least produce something with a
> consistent approach to the problem.
>
>>
>> > 2) making things easy for the user - for me a simpler more predictable
>> > API
>> > is better for the user. Having a function stored inside the database is
>> > bad,
>> > because you cannot see what function might be stored in there...
>>
>> We could let you query the stored function.
>
> Why would you need to read it. Every time you open the database you would
> need to check the function is the one you expect. The code would have to
> contain the function so it can compare it with the one in the DB and update
> it if necessary. If the code contains the function there are two copies of
> the function, one in the database and one in the code? which one is correct?
> which one is it using? So sometimes you will write the new function to the
> database, and sometimes you will not? More paths to test in code coverage,
> more complexity. Its simpler to just always set the function when opening
> the database.
>
>>
>> > it might be
>> > a function from a previous version of the code and cause all sorts of
>> > strange bugs (which will only affect certain users with a certain
>> > version of
>> > the function stored in their DB).
>>
>> It will cause *much* less strange bugs than if you have one index that
>> used two different collations, which is the alternative possibility.
>> If the function is stored, the worst case will be that the collation
>> function is out of date.  In practice, authors will mostly want to use
>> established collation functions like UCA and won't mind if they're out
>> of date.  They'll also only very rarely have occasion to deliberately
>> change the function.
>
> As I said, you will end up querying the function to see if it is the one you
> want to use, if you do that you may as well set it every time.
> Thinking about this a bit more. If you change the collation function you
> need to re-sort the index to make sure it will work (and avoid those strange
> bugs). Storing the function in the DB enables you to compare the function
> and only change it when you need to, thus optimising the number of re-sorts.
> That is the _only_ advantage to storing the function - as you still need to
> check the function stored is the one you expect to guarantee your code will
> run properly. So with a non-persisted function we need to sort every time we
> open to make sure the order is correct. However, if we attach a version
> number to the index, we can check the version number in out code to know if
> we need to resort the index. The simplest API for this would be:
> index.setCollation(1.1, my_collation_function);
> So the version number is checked against the index. If it is the same, the
> supplied collation function is used without re-sorting the index. If it is
> different the index order is checked/re-sorted. All you have to do is
> remember to up the version number. Local testing before rolling out the
> changes should catch failure to do so.

We have already decided that we don't want to take on the complexity
that comes with supporting changing collations on existing data. In
particular it becomes very unclear what to do with data that is no
longer unique under the new collation.

>> On Wed, May 4, 2011 at 4:01 PM, Jonas Sicking <jonas@sicking.cc> wrote:
>> > Browsers can certainly deal with this, and ensure that the only one
>> > suffering is the author of the buggy algorithm. However this comes at
>> > a cost in that the browser sorting algorithm can't go into infinite
>> > loops or crash even in the face of the most ridiculous comparison
>> > algorithm. In other words, the browser will likely have to use a
>> > slower sorting implementation in order to be robust.
>>
>> The browser will only run the function once every time the given field
>> changes, and change the value used in the index if it's different from
>> the current one.  The actual sorting will still be binary, just with a
>> user-provided key.  So there's no possibility of especially bad
>> effects if you're given a bad function.  You're only running it once
>> per value, so it's no worse than any other function that's run a bunch
>> of times.
>>
>> We aren't talking about a sort()-style comparison function that
>> returns -1 or 0 or 1.  We're talking about a function that takes a
>> string as input, and outputs a string to be used in the index as the
>> key for the object in question.  I guess you *could* also do it as a
>> comparison function too -- would probably be easier to write, but also
>> a lot easier to get badly wrong, and you'd have to do a bunch of
>> function calls on insert or update instead of just one.
>
> A comparison function would be a lot simpler for the user to write.

And a lot slower. For inserting N records in the database it'll take
in the order of N * log2(N) calls to the comparison function. For each
call you have to pay the penalty of crossing between languages as well
as rechecking all your state once you get back. You additionally have
to rely on users supplying the same collation function as well as
specifically signal to the API whenever

I think ultimately we simply seem to disagree here. I think that
supporting a standard set of collations is going to solve more than
80% of the use cases (which is a good rule of thumb for these things)
for version 1 as well as is easier on users and so something we'll
ultimately will want to add anyway. Thus adding it now won't be
painting us in a corner and it solves the majority of use cases.

If I understand you correctly you don't think that it solves the
majority of use cases and you think that it adds API which is bad and
that we should never add.

Is this a correct assessment?

/ Jonas
Received on Friday, 6 May 2011 02:01:27 UTC