Re: [IndexedDB] Closing on bug 9903 (collations) from Keean Schupke on 2011-05-05 (public-webapps@w3.org from April to June 2011)

From: Keean Schupke <keean@fry-it.com>
Date: Thu, 5 May 2011 07:12:55 +0100
To: Aryeh Gregor <Simetrical+w3c@gmail.com>
Cc: Jonas Sicking <jonas@sicking.cc>, Pablo Castro <Pablo.Castro@microsoft.com>, "public-webapps@w3.org" <public-webapps@w3.org>
Message-ID: <BANLkTikf3-Y5aS7qoqak+sA_1hOa8O7eSA@mail.gmail.com>
On 5 May 2011 00:33, Aryeh Gregor <Simetrical+w3c@gmail.com> wrote:

> On Tue, May 3, 2011 at 7:57 PM, Jonas Sicking <jonas@sicking.cc> wrote:
> > I don't think we should do callbacks for the first version of
> > javascript. It gets very messy since we can't rely on that the script
> > function will be returning stable values.
>
> The worst that would happen if it didn't return stable values is that
> sorting would return unpredictable results.
>

Worst is an infinite loop - no return.


>
> > So the choice here really is between only supporting some form of
> > binary sorting, or supporting a built-in set of collations. Anything
> > else will have to wait for version 2 in my opinion.
>
> I think it would be a mistake to try supporting a limited set of
> natural-language collations.  Binary collation is fine for a first
> version.  MySQL only supported binary collation up through version 4,
> for instance.
>

A good point about MySQL.


>
> On Wed, May 4, 2011 at 3:49 AM, Keean Schupke <keean@fry-it.com> wrote:
> > I thought only the app that created the db could open it (for security
> > reasons)... so it becomes the app's responsibility to do version control.
> > The comparison function is not going to change by itself - someone has to
> go
> > into the code and change it, when they do that they should up the
> revision
> > of the database, if that change is incompatible.
>
> Why should we let such a pitfall exist if we can just store the
> function and avoid the issue?
>

I don't see it as a pitfall, it is an has the advantage of transparency.


>
> > There is exactly the same problem with object properties. If the app
> changes
> > to expect a new property on all objects stored, then the app has to
> > correctly deal with the update.
>
> If a requested property doesn't exist, I assume the API will fail
> immediately with a clear error code.  It will not fail silently and
> mysteriously with no error code.  (Again, I haven't looked at it
> closely, or tried to use it.)
>

What if the new version uses the same property name for a different thing?
For example in V1 'Employer' is a string name, and in V2 'Employer' is a
reference to another object. You may say 'you should change the column
name'? Right thats just the same as me saying you should change the DB
version number when you change the collation algorithm. Its the same thing.
People seem to be making a big fuss about having a non-persisted collation
function defined in user code, when many many things require the code to
have the correct model of the data stored in the database to work properly.
It seems illogical to make a special case for this function, and not do
anything about all the other cases. IMHO either the database should have a
stored schema, or it should not. If IndexedDB is going the direction of not
having a stored schema, then the designers should have the confidence in
their decision to stick with it and at least produce something with a
consistent approach to the problem.


>
> > 2) making things easy for the user - for me a simpler more predictable
> API
> > is better for the user. Having a function stored inside the database is
> bad,
> > because you cannot see what function might be stored in there...
>
> We could let you query the stored function.
>

Why would you need to read it. Every time you open the database you would
need to check the function is the one you expect. The code would have to
contain the function so it can compare it with the one in the DB and update
it if necessary. If the code contains the function there are two copies of
the function, one in the database and one in the code? which one is correct?
which one is it using? So sometimes you will write the new function to the
database, and sometimes you will not? More paths to test in code coverage,
more complexity. Its simpler to just always set the function when opening
the database.


>
> > it might be
> > a function from a previous version of the code and cause all sorts of
> > strange bugs (which will only affect certain users with a certain version
> of
> > the function stored in their DB).
>
> It will cause *much* less strange bugs than if you have one index that
> used two different collations, which is the alternative possibility.
> If the function is stored, the worst case will be that the collation
> function is out of date.  In practice, authors will mostly want to use
> established collation functions like UCA and won't mind if they're out
> of date.  They'll also only very rarely have occasion to deliberately
> change the function.
>

As I said, you will end up querying the function to see if it is the one you
want to use, if you do that you may as well set it every time.

Thinking about this a bit more. If you change the collation function you
need to re-sort the index to make sure it will work (and avoid those strange
bugs). Storing the function in the DB enables you to compare the function
and only change it when you need to, thus optimising the number of re-sorts.
That is the _only_ advantage to storing the function - as you still need to
check the function stored is the one you expect to guarantee your code will
run properly. So with a non-persisted function we need to sort every time we
open to make sure the order is correct. However, if we attach a version
number to the index, we can check the version number in out code to know if
we need to resort the index. The simplest API for this would be:

index.setCollation(1.1, my_collation_function);

So the version number is checked against the index. If it is the same, the
supplied collation function is used without re-sorting the index. If it is
different the index order is checked/re-sorted. All you have to do is
remember to up the version number. Local testing before rolling out the
changes should catch failure to do so.


>
> On Wed, May 4, 2011 at 4:01 PM, Jonas Sicking <jonas@sicking.cc> wrote:
> > Browsers can certainly deal with this, and ensure that the only one
> > suffering is the author of the buggy algorithm. However this comes at
> > a cost in that the browser sorting algorithm can't go into infinite
> > loops or crash even in the face of the most ridiculous comparison
> > algorithm. In other words, the browser will likely have to use a
> > slower sorting implementation in order to be robust.
>
> The browser will only run the function once every time the given field
> changes, and change the value used in the index if it's different from
> the current one.  The actual sorting will still be binary, just with a
> user-provided key.  So there's no possibility of especially bad
> effects if you're given a bad function.  You're only running it once
> per value, so it's no worse than any other function that's run a bunch
> of times.
>
> We aren't talking about a sort()-style comparison function that
> returns -1 or 0 or 1.  We're talking about a function that takes a
> string as input, and outputs a string to be used in the index as the
> key for the object in question.  I guess you *could* also do it as a
> comparison function too -- would probably be easier to write, but also
> a lot easier to get badly wrong, and you'd have to do a bunch of
> function calls on insert or update instead of just one.
>

A comparison function would be a lot simpler for the user to write.


>
> > Additionally, there is a significant cost involved in transitioning
> > between the C++ code implementing the sorting algorithm, and the
> > javascript implemented callback. That is on top of the cost of
> > implementing the comparison function in javascript. Even in the best
> > JITs, there is a significant overhead to both these parts.
>
> It would only have to be run once per row (object?) modified.  Not run
> at all for reads.  Would that really be so bad?  Also, most authors
> would be content with built-in CLDR-based sort functions, which could
> be C++.
>

Maybe IO times will dominate, and the cost of the call to JavaScript even on
every comparison which should be called only log2(number of keys  per BTree
node) for each node, will not be on the critical timing path.


Cheers,
Keean.
Received on Thursday, 5 May 2011 06:13:24 UTC