Re: [IndexedDB] Closing on bug 9903 (collations) from Jonas Sicking on 2011-05-06 (public-webapps@w3.org from April to June 2011)

From: Jonas Sicking <jonas@sicking.cc>
Date: Fri, 6 May 2011 02:18:05 -0700
To: Keean Schupke <keean@fry-it.com>
Cc: Aryeh Gregor <Simetrical+w3c@gmail.com>, Pablo Castro <Pablo.Castro@microsoft.com>, "public-webapps@w3.org" <public-webapps@w3.org>
Message-ID: <BANLkTiktm2twfjs1b8ANm2TkKhhrq=W+xQ@mail.gmail.com>
On Thu, May 5, 2011 at 11:36 PM, Keean Schupke <keean@fry-it.com> wrote:
> On 6 May 2011 03:00, Jonas Sicking <jonas@sicking.cc> wrote:
>>
>> On Wed, May 4, 2011 at 11:12 PM, Keean Schupke <keean@fry-it.com> wrote:
>> > On 5 May 2011 00:33, Aryeh Gregor <Simetrical+w3c@gmail.com> wrote:
>> >>
>> >> On Tue, May 3, 2011 at 7:57 PM, Jonas Sicking <jonas@sicking.cc> wrote:
>> >> > I don't think we should do callbacks for the first version of
>> >> > javascript. It gets very messy since we can't rely on that the script
>> >> > function will be returning stable values.
>> >>
>> >> The worst that would happen if it didn't return stable values is that
>> >> sorting would return unpredictable results.
>> >
>> > Worst is an infinite loop - no return.
>> >
>> >>
>> >> > So the choice here really is between only supporting some form of
>> >> > binary sorting, or supporting a built-in set of collations. Anything
>> >> > else will have to wait for version 2 in my opinion.
>> >>
>> >> I think it would be a mistake to try supporting a limited set of
>> >> natural-language collations.  Binary collation is fine for a first
>> >> version.  MySQL only supported binary collation up through version 4,
>> >> for instance.
>> >
>> > A good point about MySQL.
>> >
>> >>
>> >> On Wed, May 4, 2011 at 3:49 AM, Keean Schupke <keean@fry-it.com> wrote:
>> >> > I thought only the app that created the db could open it (for
>> >> > security
>> >> > reasons)... so it becomes the app's responsibility to do version
>> >> > control.
>> >> > The comparison function is not going to change by itself - someone
>> >> > has
>> >> > to go
>> >> > into the code and change it, when they do that they should up the
>> >> > revision
>> >> > of the database, if that change is incompatible.
>> >>
>> >> Why should we let such a pitfall exist if we can just store the
>> >> function and avoid the issue?
>> >
>> > I don't see it as a pitfall, it is an has the advantage of transparency.
>> >
>> >>
>> >> > There is exactly the same problem with object properties. If the app
>> >> > changes
>> >> > to expect a new property on all objects stored, then the app has to
>> >> > correctly deal with the update.
>> >>
>> >> If a requested property doesn't exist, I assume the API will fail
>> >> immediately with a clear error code.  It will not fail silently and
>> >> mysteriously with no error code.  (Again, I haven't looked at it
>> >> closely, or tried to use it.)
>> >
>> > What if the new version uses the same property name for a different
>> > thing?
>> > For example in V1 'Employer' is a string name, and in V2 'Employer' is a
>> > reference to another object. You may say 'you should change the column
>> > name'? Right thats just the same as me saying you should change the DB
>> > version number when you change the collation algorithm. Its the same
>> > thing.
>> > People seem to be making a big fuss about having a non-persisted
>> > collation
>> > function defined in user code, when many many things require the code to
>> > have the correct model of the data stored in the database to work
>> > properly.
>> > It seems illogical to make a special case for this function, and not do
>> > anything about all the other cases. IMHO either the database should have
>> > a
>> > stored schema, or it should not. If IndexedDB is going the direction of
>> > not
>> > having a stored schema, then the designers should have the confidence in
>> > their decision to stick with it and at least produce something with a
>> > consistent approach to the problem.
>> >
>> >>
>> >> > 2) making things easy for the user - for me a simpler more
>> >> > predictable
>> >> > API
>> >> > is better for the user. Having a function stored inside the database
>> >> > is
>> >> > bad,
>> >> > because you cannot see what function might be stored in there...
>> >>
>> >> We could let you query the stored function.
>> >
>> > Why would you need to read it. Every time you open the database you
>> > would
>> > need to check the function is the one you expect. The code would have to
>> > contain the function so it can compare it with the one in the DB and
>> > update
>> > it if necessary. If the code contains the function there are two copies
>> > of
>> > the function, one in the database and one in the code? which one is
>> > correct?
>> > which one is it using? So sometimes you will write the new function to
>> > the
>> > database, and sometimes you will not? More paths to test in code
>> > coverage,
>> > more complexity. Its simpler to just always set the function when
>> > opening
>> > the database.
>> >
>> >>
>> >> > it might be
>> >> > a function from a previous version of the code and cause all sorts of
>> >> > strange bugs (which will only affect certain users with a certain
>> >> > version of
>> >> > the function stored in their DB).
>> >>
>> >> It will cause *much* less strange bugs than if you have one index that
>> >> used two different collations, which is the alternative possibility.
>> >> If the function is stored, the worst case will be that the collation
>> >> function is out of date.  In practice, authors will mostly want to use
>> >> established collation functions like UCA and won't mind if they're out
>> >> of date.  They'll also only very rarely have occasion to deliberately
>> >> change the function.
>> >
>> > As I said, you will end up querying the function to see if it is the one
>> > you
>> > want to use, if you do that you may as well set it every time.
>> > Thinking about this a bit more. If you change the collation function you
>> > need to re-sort the index to make sure it will work (and avoid those
>> > strange
>> > bugs). Storing the function in the DB enables you to compare the
>> > function
>> > and only change it when you need to, thus optimising the number of
>> > re-sorts.
>> > That is the _only_ advantage to storing the function - as you still need
>> > to
>> > check the function stored is the one you expect to guarantee your code
>> > will
>> > run properly. So with a non-persisted function we need to sort every
>> > time we
>> > open to make sure the order is correct. However, if we attach a version
>> > number to the index, we can check the version number in out code to know
>> > if
>> > we need to resort the index. The simplest API for this would be:
>> > index.setCollation(1.1, my_collation_function);
>> > So the version number is checked against the index. If it is the same,
>> > the
>> > supplied collation function is used without re-sorting the index. If it
>> > is
>> > different the index order is checked/re-sorted. All you have to do is
>> > remember to up the version number. Local testing before rolling out the
>> > changes should catch failure to do so.
>>
>> We have already decided that we don't want to take on the complexity
>> that comes with supporting changing collations on existing data. In
>> particular it becomes very unclear what to do with data that is no
>> longer unique under the new collation.
>>
>> >> On Wed, May 4, 2011 at 4:01 PM, Jonas Sicking <jonas@sicking.cc> wrote:
>> >> > Browsers can certainly deal with this, and ensure that the only one
>> >> > suffering is the author of the buggy algorithm. However this comes at
>> >> > a cost in that the browser sorting algorithm can't go into infinite
>> >> > loops or crash even in the face of the most ridiculous comparison
>> >> > algorithm. In other words, the browser will likely have to use a
>> >> > slower sorting implementation in order to be robust.
>> >>
>> >> The browser will only run the function once every time the given field
>> >> changes, and change the value used in the index if it's different from
>> >> the current one.  The actual sorting will still be binary, just with a
>> >> user-provided key.  So there's no possibility of especially bad
>> >> effects if you're given a bad function.  You're only running it once
>> >> per value, so it's no worse than any other function that's run a bunch
>> >> of times.
>> >>
>> >> We aren't talking about a sort()-style comparison function that
>> >> returns -1 or 0 or 1.  We're talking about a function that takes a
>> >> string as input, and outputs a string to be used in the index as the
>> >> key for the object in question.  I guess you *could* also do it as a
>> >> comparison function too -- would probably be easier to write, but also
>> >> a lot easier to get badly wrong, and you'd have to do a bunch of
>> >> function calls on insert or update instead of just one.
>> >
>> > A comparison function would be a lot simpler for the user to write.
>>
>> And a lot slower. For inserting N records in the database it'll take
>> in the order of N * log2(N) calls to the comparison function. For each
>> call you have to pay the penalty of crossing between languages as well
>> as rechecking all your state once you get back. You additionally have
>> to rely on users supplying the same collation function as well as
>> specifically signal to the API whenever
>>
>> I think ultimately we simply seem to disagree here. I think that
>> supporting a standard set of collations is going to solve more than
>> 80% of the use cases (which is a good rule of thumb for these things)
>> for version 1 as well as is easier on users and so something we'll
>> ultimately will want to add anyway. Thus adding it now won't be
>> painting us in a corner and it solves the majority of use cases.
>>
>> If I understand you correctly you don't think that it solves the
>> majority of use cases and you think that it adds API which is bad and
>> that we should never add.
>>
>> Is this a correct assessment?
>>
>> / Jonas
>
>
> I think it solves the majority of the use cases, but only if all browsers
> implement the same useful set of collations, and updates to that set are
> managed in a predictable / useful way across browsers in the future.  This
> still leaves the problem that some programs may not behave as intended by
> the author (after updating the collations) or the collations will not be up
> to date (with the latest CLDR). However in general I would be happy for
> something like the standard unicode sorting algorithm to be be pre-installed
> for the user.
> The second point is the API, I don't think its a bad API, but I do think its
> inelegant. Passing the index a sort-order mapping function (which was my
> original suggestion) or a comparison function (which I think may be easier
> for the average programmer to write), where this comparison function may be
> user supplied or provided by the browser. If you want to optimise for speed,
> observe that every function is unique for example:
> function a() {};
> function b() {};
> // a !== b
> So for the built in functions there only needs to be a pre-defined unique
> function object, and that unique ID can be used in the C++ code to directly
> use a C++ implementation of sort. So if you use the standard function there
> would be no call overhead - you only get the overhead if you use a user
> defined function. IMHO two different APIs is not a big problem, but why have
> two if one can do it all elegantly.
> So in summery, there are important concerns about managing updates to the
> collations going forward, then there are my personal feelings about what
> makes a good API.  I am prepared to justify my opinion on API design, and
> think its best to raise these issues, but I understand that other people may
> not share these views.

Yes, we could use a comparison function and supply a set of such
comparison functions from the browser and make sure to optimize the
case where the comparison function is one that is supplied by the
implementation.

That takes care of the performance problem, but only if you stick to
the feature set of the API that I'm proposing.

That still leaves the fact that the collation function has to be
provided every time the database is opened, which does not at all fit
with how the rest of the API works. And yes, I'm aware that you don't
like the way that the API works, but I think doing something inbetween
is the worst possible solution.

And it also doesn't handle the problem of what to do if someone does
provide a different collation function between open calls to the
database, which might have different ideas of which values compare
equal. Even if the author only makes use of the built-in collation
functions we still are left with the problem of what happens if that
function changes between open calls? Or worse, what happens if two
pages open the same database, but uses different collations in the two
pages?

So all in all, compared to what Pablo is proposing, your proposal only
adds support for a rare set of use cases, while not fitting with the
rest of the API, and introduces a whole set of edge cases that we need
to spend time defining and handling in the implementation.

Based on that, my conclusion is that we should go with what Pablo is
proposing. And I think we should do it for v1.

/ Jonas
Received on Friday, 6 May 2011 09:19:04 UTC