Re: [IndexedDB] Closing on bug 9903 (collations) from Keean Schupke on 2011-05-06 (public-webapps@w3.org from April to June 2011)

From: Keean Schupke <keean@fry-it.com>
Date: Fri, 6 May 2011 12:09:38 +0100
To: Jonas Sicking <jonas@sicking.cc>
Cc: Aryeh Gregor <Simetrical+w3c@gmail.com>, Pablo Castro <Pablo.Castro@microsoft.com>, "public-webapps@w3.org" <public-webapps@w3.org>
Message-ID: <BANLkTikhAdFLX5KrKJa1HoriMWDSKDFR0w@mail.gmail.com>
On 6 May 2011 10:18, Jonas Sicking <jonas@sicking.cc> wrote:

> On Thu, May 5, 2011 at 11:36 PM, Keean Schupke <keean@fry-it.com> wrote:
> > On 6 May 2011 03:00, Jonas Sicking <jonas@sicking.cc> wrote:
> >>
> >> On Wed, May 4, 2011 at 11:12 PM, Keean Schupke <keean@fry-it.com>
> wrote:
> >> > On 5 May 2011 00:33, Aryeh Gregor <Simetrical+w3c@gmail.com> wrote:
> >> >>
> >> >> On Tue, May 3, 2011 at 7:57 PM, Jonas Sicking <jonas@sicking.cc>
> wrote:
> >> >> > I don't think we should do callbacks for the first version of
> >> >> > javascript. It gets very messy since we can't rely on that the
> script
> >> >> > function will be returning stable values.
> >> >>
> >> >> The worst that would happen if it didn't return stable values is that
> >> >> sorting would return unpredictable results.
> >> >
> >> > Worst is an infinite loop - no return.
> >> >
> >> >>
> >> >> > So the choice here really is between only supporting some form of
> >> >> > binary sorting, or supporting a built-in set of collations.
> Anything
> >> >> > else will have to wait for version 2 in my opinion.
> >> >>
> >> >> I think it would be a mistake to try supporting a limited set of
> >> >> natural-language collations.  Binary collation is fine for a first
> >> >> version.  MySQL only supported binary collation up through version 4,
> >> >> for instance.
> >> >
> >> > A good point about MySQL.
> >> >
> >> >>
> >> >> On Wed, May 4, 2011 at 3:49 AM, Keean Schupke <keean@fry-it.com>
> wrote:
> >> >> > I thought only the app that created the db could open it (for
> >> >> > security
> >> >> > reasons)... so it becomes the app's responsibility to do version
> >> >> > control.
> >> >> > The comparison function is not going to change by itself - someone
> >> >> > has
> >> >> > to go
> >> >> > into the code and change it, when they do that they should up the
> >> >> > revision
> >> >> > of the database, if that change is incompatible.
> >> >>
> >> >> Why should we let such a pitfall exist if we can just store the
> >> >> function and avoid the issue?
> >> >
> >> > I don't see it as a pitfall, it is an has the advantage of
> transparency.
> >> >
> >> >>
> >> >> > There is exactly the same problem with object properties. If the
> app
> >> >> > changes
> >> >> > to expect a new property on all objects stored, then the app has to
> >> >> > correctly deal with the update.
> >> >>
> >> >> If a requested property doesn't exist, I assume the API will fail
> >> >> immediately with a clear error code.  It will not fail silently and
> >> >> mysteriously with no error code.  (Again, I haven't looked at it
> >> >> closely, or tried to use it.)
> >> >
> >> > What if the new version uses the same property name for a different
> >> > thing?
> >> > For example in V1 'Employer' is a string name, and in V2 'Employer' is
> a
> >> > reference to another object. You may say 'you should change the column
> >> > name'? Right thats just the same as me saying you should change the DB
> >> > version number when you change the collation algorithm. Its the same
> >> > thing.
> >> > People seem to be making a big fuss about having a non-persisted
> >> > collation
> >> > function defined in user code, when many many things require the code
> to
> >> > have the correct model of the data stored in the database to work
> >> > properly.
> >> > It seems illogical to make a special case for this function, and not
> do
> >> > anything about all the other cases. IMHO either the database should
> have
> >> > a
> >> > stored schema, or it should not. If IndexedDB is going the direction
> of
> >> > not
> >> > having a stored schema, then the designers should have the confidence
> in
> >> > their decision to stick with it and at least produce something with a
> >> > consistent approach to the problem.
> >> >
> >> >>
> >> >> > 2) making things easy for the user - for me a simpler more
> >> >> > predictable
> >> >> > API
> >> >> > is better for the user. Having a function stored inside the
> database
> >> >> > is
> >> >> > bad,
> >> >> > because you cannot see what function might be stored in there...
> >> >>
> >> >> We could let you query the stored function.
> >> >
> >> > Why would you need to read it. Every time you open the database you
> >> > would
> >> > need to check the function is the one you expect. The code would have
> to
> >> > contain the function so it can compare it with the one in the DB and
> >> > update
> >> > it if necessary. If the code contains the function there are two
> copies
> >> > of
> >> > the function, one in the database and one in the code? which one is
> >> > correct?
> >> > which one is it using? So sometimes you will write the new function to
> >> > the
> >> > database, and sometimes you will not? More paths to test in code
> >> > coverage,
> >> > more complexity. Its simpler to just always set the function when
> >> > opening
> >> > the database.
> >> >
> >> >>
> >> >> > it might be
> >> >> > a function from a previous version of the code and cause all sorts
> of
> >> >> > strange bugs (which will only affect certain users with a certain
> >> >> > version of
> >> >> > the function stored in their DB).
> >> >>
> >> >> It will cause *much* less strange bugs than if you have one index
> that
> >> >> used two different collations, which is the alternative possibility.
> >> >> If the function is stored, the worst case will be that the collation
> >> >> function is out of date.  In practice, authors will mostly want to
> use
> >> >> established collation functions like UCA and won't mind if they're
> out
> >> >> of date.  They'll also only very rarely have occasion to deliberately
> >> >> change the function.
> >> >
> >> > As I said, you will end up querying the function to see if it is the
> one
> >> > you
> >> > want to use, if you do that you may as well set it every time.
> >> > Thinking about this a bit more. If you change the collation function
> you
> >> > need to re-sort the index to make sure it will work (and avoid those
> >> > strange
> >> > bugs). Storing the function in the DB enables you to compare the
> >> > function
> >> > and only change it when you need to, thus optimising the number of
> >> > re-sorts.
> >> > That is the _only_ advantage to storing the function - as you still
> need
> >> > to
> >> > check the function stored is the one you expect to guarantee your code
> >> > will
> >> > run properly. So with a non-persisted function we need to sort every
> >> > time we
> >> > open to make sure the order is correct. However, if we attach a
> version
> >> > number to the index, we can check the version number in out code to
> know
> >> > if
> >> > we need to resort the index. The simplest API for this would be:
> >> > index.setCollation(1.1, my_collation_function);
> >> > So the version number is checked against the index. If it is the same,
> >> > the
> >> > supplied collation function is used without re-sorting the index. If
> it
> >> > is
> >> > different the index order is checked/re-sorted. All you have to do is
> >> > remember to up the version number. Local testing before rolling out
> the
> >> > changes should catch failure to do so.
> >>
> >> We have already decided that we don't want to take on the complexity
> >> that comes with supporting changing collations on existing data. In
> >> particular it becomes very unclear what to do with data that is no
> >> longer unique under the new collation.
> >>
> >> >> On Wed, May 4, 2011 at 4:01 PM, Jonas Sicking <jonas@sicking.cc>
> wrote:
> >> >> > Browsers can certainly deal with this, and ensure that the only one
> >> >> > suffering is the author of the buggy algorithm. However this comes
> at
> >> >> > a cost in that the browser sorting algorithm can't go into infinite
> >> >> > loops or crash even in the face of the most ridiculous comparison
> >> >> > algorithm. In other words, the browser will likely have to use a
> >> >> > slower sorting implementation in order to be robust.
> >> >>
> >> >> The browser will only run the function once every time the given
> field
> >> >> changes, and change the value used in the index if it's different
> from
> >> >> the current one.  The actual sorting will still be binary, just with
> a
> >> >> user-provided key.  So there's no possibility of especially bad
> >> >> effects if you're given a bad function.  You're only running it once
> >> >> per value, so it's no worse than any other function that's run a
> bunch
> >> >> of times.
> >> >>
> >> >> We aren't talking about a sort()-style comparison function that
> >> >> returns -1 or 0 or 1.  We're talking about a function that takes a
> >> >> string as input, and outputs a string to be used in the index as the
> >> >> key for the object in question.  I guess you *could* also do it as a
> >> >> comparison function too -- would probably be easier to write, but
> also
> >> >> a lot easier to get badly wrong, and you'd have to do a bunch of
> >> >> function calls on insert or update instead of just one.
> >> >
> >> > A comparison function would be a lot simpler for the user to write.
> >>
> >> And a lot slower. For inserting N records in the database it'll take
> >> in the order of N * log2(N) calls to the comparison function. For each
> >> call you have to pay the penalty of crossing between languages as well
> >> as rechecking all your state once you get back. You additionally have
> >> to rely on users supplying the same collation function as well as
> >> specifically signal to the API whenever
> >>
> >> I think ultimately we simply seem to disagree here. I think that
> >> supporting a standard set of collations is going to solve more than
> >> 80% of the use cases (which is a good rule of thumb for these things)
> >> for version 1 as well as is easier on users and so something we'll
> >> ultimately will want to add anyway. Thus adding it now won't be
> >> painting us in a corner and it solves the majority of use cases.
> >>
> >> If I understand you correctly you don't think that it solves the
> >> majority of use cases and you think that it adds API which is bad and
> >> that we should never add.
> >>
> >> Is this a correct assessment?
> >>
> >> / Jonas
> >
> >
> > I think it solves the majority of the use cases, but only if all browsers
> > implement the same useful set of collations, and updates to that set are
> > managed in a predictable / useful way across browsers in the future.
>  This
> > still leaves the problem that some programs may not behave as intended by
> > the author (after updating the collations) or the collations will not be
> up
> > to date (with the latest CLDR). However in general I would be happy for
> > something like the standard unicode sorting algorithm to be be
> pre-installed
> > for the user.
> > The second point is the API, I don't think its a bad API, but I do think
> its
> > inelegant. Passing the index a sort-order mapping function (which was my
> > original suggestion) or a comparison function (which I think may be
> easier
> > for the average programmer to write), where this comparison function may
> be
> > user supplied or provided by the browser. If you want to optimise for
> speed,
> > observe that every function is unique for example:
> > function a() {};
> > function b() {};
> > // a !== b
> > So for the built in functions there only needs to be a pre-defined unique
> > function object, and that unique ID can be used in the C++ code to
> directly
> > use a C++ implementation of sort. So if you use the standard function
> there
> > would be no call overhead - you only get the overhead if you use a user
> > defined function. IMHO two different APIs is not a big problem, but why
> have
> > two if one can do it all elegantly.
> > So in summery, there are important concerns about managing updates to the
> > collations going forward, then there are my personal feelings about what
> > makes a good API.  I am prepared to justify my opinion on API design, and
> > think its best to raise these issues, but I understand that other people
> may
> > not share these views.
>
> Yes, we could use a comparison function and supply a set of such
> comparison functions from the browser and make sure to optimize the
> case where the comparison function is one that is supplied by the
> implementation.
>
> That takes care of the performance problem, but only if you stick to
> the feature set of the API that I'm proposing.
>
> That still leaves the fact that the collation function has to be
> provided every time the database is opened, which does not at all fit
> with how the rest of the API works. And yes, I'm aware that you don't
> like the way that the API works, but I think doing something inbetween
> is the worst possible solution.
>

So there is now a schema that has to be managed. To be able to work with my
relational library's declarative style, it must be possible to query all the
stored properties. Whilst not as clean IMHO as having it stateless, this
would still let me do what I need in the library layer.


>
> And it also doesn't handle the problem of what to do if someone does
> provide a different collation function between open calls to the
> database, which might have different ideas of which values compare
> equal. Even if the author only makes use of the built-in collation
> functions we still are left with the problem of what happens if that
> function changes between open calls? Or worse, what happens if two
> pages open the same database, but uses different collations in the two
> pages?
>

I would recommend using modular programming and providing a javascript file
to include which does the open for each page.


>
> So all in all, compared to what Pablo is proposing, your proposal only
> adds support for a rare set of use cases, while not fitting with the
> rest of the API, and introduces a whole set of edge cases that we need
> to spend time defining and handling in the implementation.
>

My proposal allows libraries to make sure that collations offered by
different backends are the same.   If WebSQL offers a collation order (from
SQLite) that IndexedDB does not support, the library can provide an
implementation.


>
> Based on that, my conclusion is that we should go with what Pablo is
> proposing. And I think we should do it for v1.


> / Jonas
>

I agree with that for v1,

with this requirement: all the stored properties are readable.

and with this request: the ability to supply a collation function (that is
not persisted) for library authors writing libraries like relationalDB over
the top of IndexedDB. On that basis, all the arguments about users writing
pages are moot, as I think we need to support library authors too.


Cheers,
Keean.
Received on Friday, 6 May 2011 11:10:08 UTC