Re: [IndexedDB] Closing on bug 9903 (collations) from Jonas Sicking on 2011-05-06 (public-webapps@w3.org from April to June 2011)

From: Jonas Sicking <jonas@sicking.cc>
Date: Fri, 6 May 2011 05:09:40 -0700
To: Keean Schupke <keean@fry-it.com>
Cc: Aryeh Gregor <Simetrical+w3c@gmail.com>, Pablo Castro <Pablo.Castro@microsoft.com>, "public-webapps@w3.org" <public-webapps@w3.org>
Message-ID: <BANLkTikCxFh5JsdNrYRikK_xPO2dCcxkag@mail.gmail.com>
On Fri, May 6, 2011 at 4:09 AM, Keean Schupke <keean@fry-it.com> wrote:
> On 6 May 2011 10:18, Jonas Sicking <jonas@sicking.cc> wrote:
>>
>> On Thu, May 5, 2011 at 11:36 PM, Keean Schupke <keean@fry-it.com> wrote:
>> > On 6 May 2011 03:00, Jonas Sicking <jonas@sicking.cc> wrote:
>> >>
>> >> On Wed, May 4, 2011 at 11:12 PM, Keean Schupke <keean@fry-it.com>
>> >> wrote:
>> >> > On 5 May 2011 00:33, Aryeh Gregor <Simetrical+w3c@gmail.com> wrote:
>> >> >>
>> >> >> On Tue, May 3, 2011 at 7:57 PM, Jonas Sicking <jonas@sicking.cc>
>> >> >> wrote:
>> >> >> > I don't think we should do callbacks for the first version of
>> >> >> > javascript. It gets very messy since we can't rely on that the
>> >> >> > script
>> >> >> > function will be returning stable values.
>> >> >>
>> >> >> The worst that would happen if it didn't return stable values is
>> >> >> that
>> >> >> sorting would return unpredictable results.
>> >> >
>> >> > Worst is an infinite loop - no return.
>> >> >
>> >> >>
>> >> >> > So the choice here really is between only supporting some form of
>> >> >> > binary sorting, or supporting a built-in set of collations.
>> >> >> > Anything
>> >> >> > else will have to wait for version 2 in my opinion.
>> >> >>
>> >> >> I think it would be a mistake to try supporting a limited set of
>> >> >> natural-language collations.  Binary collation is fine for a first
>> >> >> version.  MySQL only supported binary collation up through version
>> >> >> 4,
>> >> >> for instance.
>> >> >
>> >> > A good point about MySQL.
>> >> >
>> >> >>
>> >> >> On Wed, May 4, 2011 at 3:49 AM, Keean Schupke <keean@fry-it.com>
>> >> >> wrote:
>> >> >> > I thought only the app that created the db could open it (for
>> >> >> > security
>> >> >> > reasons)... so it becomes the app's responsibility to do version
>> >> >> > control.
>> >> >> > The comparison function is not going to change by itself - someone
>> >> >> > has
>> >> >> > to go
>> >> >> > into the code and change it, when they do that they should up the
>> >> >> > revision
>> >> >> > of the database, if that change is incompatible.
>> >> >>
>> >> >> Why should we let such a pitfall exist if we can just store the
>> >> >> function and avoid the issue?
>> >> >
>> >> > I don't see it as a pitfall, it is an has the advantage of
>> >> > transparency.
>> >> >
>> >> >>
>> >> >> > There is exactly the same problem with object properties. If the
>> >> >> > app
>> >> >> > changes
>> >> >> > to expect a new property on all objects stored, then the app has
>> >> >> > to
>> >> >> > correctly deal with the update.
>> >> >>
>> >> >> If a requested property doesn't exist, I assume the API will fail
>> >> >> immediately with a clear error code.  It will not fail silently and
>> >> >> mysteriously with no error code.  (Again, I haven't looked at it
>> >> >> closely, or tried to use it.)
>> >> >
>> >> > What if the new version uses the same property name for a different
>> >> > thing?
>> >> > For example in V1 'Employer' is a string name, and in V2 'Employer'
>> >> > is a
>> >> > reference to another object. You may say 'you should change the
>> >> > column
>> >> > name'? Right thats just the same as me saying you should change the
>> >> > DB
>> >> > version number when you change the collation algorithm. Its the same
>> >> > thing.
>> >> > People seem to be making a big fuss about having a non-persisted
>> >> > collation
>> >> > function defined in user code, when many many things require the code
>> >> > to
>> >> > have the correct model of the data stored in the database to work
>> >> > properly.
>> >> > It seems illogical to make a special case for this function, and not
>> >> > do
>> >> > anything about all the other cases. IMHO either the database should
>> >> > have
>> >> > a
>> >> > stored schema, or it should not. If IndexedDB is going the direction
>> >> > of
>> >> > not
>> >> > having a stored schema, then the designers should have the confidence
>> >> > in
>> >> > their decision to stick with it and at least produce something with a
>> >> > consistent approach to the problem.
>> >> >
>> >> >>
>> >> >> > 2) making things easy for the user - for me a simpler more
>> >> >> > predictable
>> >> >> > API
>> >> >> > is better for the user. Having a function stored inside the
>> >> >> > database
>> >> >> > is
>> >> >> > bad,
>> >> >> > because you cannot see what function might be stored in there...
>> >> >>
>> >> >> We could let you query the stored function.
>> >> >
>> >> > Why would you need to read it. Every time you open the database you
>> >> > would
>> >> > need to check the function is the one you expect. The code would have
>> >> > to
>> >> > contain the function so it can compare it with the one in the DB and
>> >> > update
>> >> > it if necessary. If the code contains the function there are two
>> >> > copies
>> >> > of
>> >> > the function, one in the database and one in the code? which one is
>> >> > correct?
>> >> > which one is it using? So sometimes you will write the new function
>> >> > to
>> >> > the
>> >> > database, and sometimes you will not? More paths to test in code
>> >> > coverage,
>> >> > more complexity. Its simpler to just always set the function when
>> >> > opening
>> >> > the database.
>> >> >
>> >> >>
>> >> >> > it might be
>> >> >> > a function from a previous version of the code and cause all sorts
>> >> >> > of
>> >> >> > strange bugs (which will only affect certain users with a certain
>> >> >> > version of
>> >> >> > the function stored in their DB).
>> >> >>
>> >> >> It will cause *much* less strange bugs than if you have one index
>> >> >> that
>> >> >> used two different collations, which is the alternative possibility.
>> >> >> If the function is stored, the worst case will be that the collation
>> >> >> function is out of date.  In practice, authors will mostly want to
>> >> >> use
>> >> >> established collation functions like UCA and won't mind if they're
>> >> >> out
>> >> >> of date.  They'll also only very rarely have occasion to
>> >> >> deliberately
>> >> >> change the function.
>> >> >
>> >> > As I said, you will end up querying the function to see if it is the
>> >> > one
>> >> > you
>> >> > want to use, if you do that you may as well set it every time.
>> >> > Thinking about this a bit more. If you change the collation function
>> >> > you
>> >> > need to re-sort the index to make sure it will work (and avoid those
>> >> > strange
>> >> > bugs). Storing the function in the DB enables you to compare the
>> >> > function
>> >> > and only change it when you need to, thus optimising the number of
>> >> > re-sorts.
>> >> > That is the _only_ advantage to storing the function - as you still
>> >> > need
>> >> > to
>> >> > check the function stored is the one you expect to guarantee your
>> >> > code
>> >> > will
>> >> > run properly. So with a non-persisted function we need to sort every
>> >> > time we
>> >> > open to make sure the order is correct. However, if we attach a
>> >> > version
>> >> > number to the index, we can check the version number in out code to
>> >> > know
>> >> > if
>> >> > we need to resort the index. The simplest API for this would be:
>> >> > index.setCollation(1.1, my_collation_function);
>> >> > So the version number is checked against the index. If it is the
>> >> > same,
>> >> > the
>> >> > supplied collation function is used without re-sorting the index. If
>> >> > it
>> >> > is
>> >> > different the index order is checked/re-sorted. All you have to do is
>> >> > remember to up the version number. Local testing before rolling out
>> >> > the
>> >> > changes should catch failure to do so.
>> >>
>> >> We have already decided that we don't want to take on the complexity
>> >> that comes with supporting changing collations on existing data. In
>> >> particular it becomes very unclear what to do with data that is no
>> >> longer unique under the new collation.
>> >>
>> >> >> On Wed, May 4, 2011 at 4:01 PM, Jonas Sicking <jonas@sicking.cc>
>> >> >> wrote:
>> >> >> > Browsers can certainly deal with this, and ensure that the only
>> >> >> > one
>> >> >> > suffering is the author of the buggy algorithm. However this comes
>> >> >> > at
>> >> >> > a cost in that the browser sorting algorithm can't go into
>> >> >> > infinite
>> >> >> > loops or crash even in the face of the most ridiculous comparison
>> >> >> > algorithm. In other words, the browser will likely have to use a
>> >> >> > slower sorting implementation in order to be robust.
>> >> >>
>> >> >> The browser will only run the function once every time the given
>> >> >> field
>> >> >> changes, and change the value used in the index if it's different
>> >> >> from
>> >> >> the current one.  The actual sorting will still be binary, just with
>> >> >> a
>> >> >> user-provided key.  So there's no possibility of especially bad
>> >> >> effects if you're given a bad function.  You're only running it once
>> >> >> per value, so it's no worse than any other function that's run a
>> >> >> bunch
>> >> >> of times.
>> >> >>
>> >> >> We aren't talking about a sort()-style comparison function that
>> >> >> returns -1 or 0 or 1.  We're talking about a function that takes a
>> >> >> string as input, and outputs a string to be used in the index as the
>> >> >> key for the object in question.  I guess you *could* also do it as a
>> >> >> comparison function too -- would probably be easier to write, but
>> >> >> also
>> >> >> a lot easier to get badly wrong, and you'd have to do a bunch of
>> >> >> function calls on insert or update instead of just one.
>> >> >
>> >> > A comparison function would be a lot simpler for the user to write.
>> >>
>> >> And a lot slower. For inserting N records in the database it'll take
>> >> in the order of N * log2(N) calls to the comparison function. For each
>> >> call you have to pay the penalty of crossing between languages as well
>> >> as rechecking all your state once you get back. You additionally have
>> >> to rely on users supplying the same collation function as well as
>> >> specifically signal to the API whenever
>> >>
>> >> I think ultimately we simply seem to disagree here. I think that
>> >> supporting a standard set of collations is going to solve more than
>> >> 80% of the use cases (which is a good rule of thumb for these things)
>> >> for version 1 as well as is easier on users and so something we'll
>> >> ultimately will want to add anyway. Thus adding it now won't be
>> >> painting us in a corner and it solves the majority of use cases.
>> >>
>> >> If I understand you correctly you don't think that it solves the
>> >> majority of use cases and you think that it adds API which is bad and
>> >> that we should never add.
>> >>
>> >> Is this a correct assessment?
>> >>
>> >> / Jonas
>> >
>> >
>> > I think it solves the majority of the use cases, but only if all
>> > browsers
>> > implement the same useful set of collations, and updates to that set are
>> > managed in a predictable / useful way across browsers in the future.
>> >  This
>> > still leaves the problem that some programs may not behave as intended
>> > by
>> > the author (after updating the collations) or the collations will not be
>> > up
>> > to date (with the latest CLDR). However in general I would be happy for
>> > something like the standard unicode sorting algorithm to be be
>> > pre-installed
>> > for the user.
>> > The second point is the API, I don't think its a bad API, but I do think
>> > its
>> > inelegant. Passing the index a sort-order mapping function (which was my
>> > original suggestion) or a comparison function (which I think may be
>> > easier
>> > for the average programmer to write), where this comparison function may
>> > be
>> > user supplied or provided by the browser. If you want to optimise for
>> > speed,
>> > observe that every function is unique for example:
>> > function a() {};
>> > function b() {};
>> > // a !== b
>> > So for the built in functions there only needs to be a pre-defined
>> > unique
>> > function object, and that unique ID can be used in the C++ code to
>> > directly
>> > use a C++ implementation of sort. So if you use the standard function
>> > there
>> > would be no call overhead - you only get the overhead if you use a user
>> > defined function. IMHO two different APIs is not a big problem, but why
>> > have
>> > two if one can do it all elegantly.
>> > So in summery, there are important concerns about managing updates to
>> > the
>> > collations going forward, then there are my personal feelings about what
>> > makes a good API.  I am prepared to justify my opinion on API design,
>> > and
>> > think its best to raise these issues, but I understand that other people
>> > may
>> > not share these views.
>>
>> Yes, we could use a comparison function and supply a set of such
>> comparison functions from the browser and make sure to optimize the
>> case where the comparison function is one that is supplied by the
>> implementation.
>>
>> That takes care of the performance problem, but only if you stick to
>> the feature set of the API that I'm proposing.
>>
>> That still leaves the fact that the collation function has to be
>> provided every time the database is opened, which does not at all fit
>> with how the rest of the API works. And yes, I'm aware that you don't
>> like the way that the API works, but I think doing something inbetween
>> is the worst possible solution.
>
> So there is now a schema that has to be managed. To be able to work with my
> relational library's declarative style, it must be possible to query all the
> stored properties. Whilst not as clean IMHO as having it stateless, this
> would still let me do what I need in the library layer.

Of course. As you've surely noticed that is already the case with the
APIs that are in the spec already.

>> And it also doesn't handle the problem of what to do if someone does
>> provide a different collation function between open calls to the
>> database, which might have different ideas of which values compare
>> equal. Even if the author only makes use of the built-in collation
>> functions we still are left with the problem of what happens if that
>> function changes between open calls? Or worse, what happens if two
>> pages open the same database, but uses different collations in the two
>> pages?
>
> I would recommend using modular programming and providing a javascript file
> to include which does the open for each page.

Sure, we can recommend that for webdevelopers. However the questions I
posed was about what IndexedDB implementations should do? I.e. what
behavior in these scenarios would we define in the spec?

>> So all in all, compared to what Pablo is proposing, your proposal only
>> adds support for a rare set of use cases, while not fitting with the
>> rest of the API, and introduces a whole set of edge cases that we need
>> to spend time defining and handling in the implementation.
>
> My proposal allows libraries to make sure that collations offered by
> different backends are the same.   If WebSQL offers a collation order (from
> SQLite) that IndexedDB does not support, the library can provide an
> implementation.
>
>>
>> Based on that, my conclusion is that we should go with what Pablo is
>> proposing. And I think we should do it for v1.
>>
>> / Jonas
>
> I agree with that for v1,
> with this requirement: all the stored properties are readable.

Of course. I don't believe anyone has suggested anything else.

> and with this request: the ability to supply a collation function (that is
> not persisted) for library authors writing libraries like relationalDB over
> the top of IndexedDB. On that basis, all the arguments about users writing
> pages are moot, as I think we need to support library authors too.

This sounds like a feature we should look at for v2 for sure.

/ Jonas
Received on Friday, 6 May 2011 12:10:38 UTC