Re: [IndexedDB] Closing on bug 9903 (collations) from Aryeh Gregor on 2011-05-05 (public-webapps@w3.org from April to June 2011)

From: Aryeh Gregor <Simetrical+w3c@gmail.com>
Date: Thu, 5 May 2011 19:22:27 -0400
To: Keean Schupke <keean@fry-it.com>
Cc: Jonas Sicking <jonas@sicking.cc>, Pablo Castro <Pablo.Castro@microsoft.com>, "public-webapps@w3.org" <public-webapps@w3.org>
Message-ID: <BANLkTik-b6=vqSA_agaFYbB=6ApYz7PMmA@mail.gmail.com>
On Thu, May 5, 2011 at 2:12 AM, Keean Schupke <keean@fry-it.com> wrote:
> What if the new version uses the same property name for a different thing?

Yes, obviously it's going to be possible for code changes to cause
hard-to-catch bugs due to not updating the database correctly.  We
don't have to add more cases where that's possible than necessary,
without good reason.  Maybe there's good reason here, but the added
potential for error can't be neglected as a cost.

> Why would you need to read it. Every time you open the database you would
> need to check the function is the one you expect.

Not if you never intend to change it, or don't care if it's outdated.
I expect this to be the most common case.

Consider the case of someone using CLDR-tailored UCA and a new version
comes out.  You want to use the newest version for new indexes, if
multiple versions are available, but there's no pressing need to
automatically update existing indexes.  The old version is almost
certainly good enough, unless your users use obscure languages.  So in
my scheme, you can just update the function in your code and do
nothing else.  In your scheme, you'd have to either stick to the old
version across the board, or include both versions in your code
indefinitely and include out-of-band logic to choose between them, or
write a script that rebuilds the whole index on update (which would
take a long time for a large index).

> The code would have to
> contain the function so it can compare it with the one in the DB and update
> it if necessary. If the code contains the function there are two copies of
> the function, one in the database and one in the code? which one is correct?
> which one is it using? So sometimes you will write the new function to the
> database, and sometimes you will not? More paths to test in code coverage,
> more complexity. Its simpler to just always set the function when opening
> the database.

If the collation function is stored in the database, then I'd expect
setting the function to rebuild the index if the new and old functions
differ.  This could happen as a background operation, with the
existing index still usable (with the old collation function) in the
meantime.  So if you always wanted collations up-to-date, in my scheme
authors could just set the function every time they open the database,
as with your scheme.  But this could trigger a silent rebuild whenever
necessary, so the author doesn't have to worry about it.  In your
scheme, the author has to do the rebuild himself, and if he gets it
wrong, the index will be corrupted.

So as I see it, my approach is easier to use across the board.  It
lets you not update collations on old tables without requiring you to
keep track of multiple collation function versions, and it also
potentially lets you update collations on old tables to the latest
versions with rebuilding done for you in the background.  Critically,
it does not let you change a sort function without rebuilding, since
that will always cause bugs and you never want to do it (to a first
approximation).

Of course, maybe an initial implementation wouldn't do rebuilds for
you, to keep it simple.  Then the collation function would be
immutable after index creation, so you'd still have to do rebuilds
yourself.  But it would still be easier and safer: the old index will
still work in the interim even if you don't have the old version of
your collation function around, and you can't mess up and get a
corrupted index.

> Thinking about this a bit more. If you change the collation function you
> need to re-sort the index to make sure it will work (and avoid those strange
> bugs). Storing the function in the DB enables you to compare the function
> and only change it when you need to, thus optimising the number of re-sorts.
> That is the _only_ advantage to storing the function - as you still need to
> check the function stored is the one you expect to guarantee your code will
> run properly. So with a non-persisted function we need to sort every time we
> open to make sure the order is correct.

And this is totally impractical for even moderately large datasets.  I
assume we want this to be usable for databases of, say, a gigabyte in
size.  You're not going to read, sort, and write a gigabyte on every
database open.

(My experience tends more toward multi-gigabyte databases or bigger,
including writing code for Wikipedia, which is multi-terabyte.  So
maybe I'm biased to think about scalability more than necessary for
IndexedDB, but resorting the index on every index still sounds really
impractical to me.)

> However, if we attach a version
> number to the index, we can check the version number in out code to know if
> we need to resort the index. The simplest API for this would be:
> index.setCollation(1.1, my_collation_function);
> So the version number is checked against the index. If it is the same, the
> supplied collation function is used without re-sorting the index. If it is
> different the index order is checked/re-sorted. All you have to do is
> remember to up the version number. Local testing before rolling out the
> changes should catch failure to do so.

So then what happens if it's not up-to-date?  You don't do any updates
to the object store until you've sorted and rewritten the index with
the new collation function?  Even on fairly modest data sets, that
could easily take ten seconds or more, which is not latency we want to
impose unnecessarily.  In contrast, if the collation function is
stored in the DB, the existing index will still work until a new index
is built.  It could even be done automatically, as I noted.

> A comparison function would be a lot simpler for the user to write.

It depends.  For instance, for basic English collation, you can get
the sortkey by stripping punctuation and diacritics and lowercasing
everything.  That's easier to write as a sortkey-generating function
than a comparison function.  In fact, if you were writing a comparison
function, you'd probably do it by normalizing both values and then
doing a binary comparison -- basically generating a sortkey in the end
anyway.

On the other hand, it's very easy to get comparison functions wrong,
like by not making them stable.  That mistake has been made by
organizations as large and professional as Microsoft:

http://arstechnica.com/microsoft/news/2010/03/coding-error-leads-to-uneven-eu-browser-ballot-distribution.ars

You can't mess up a sortkey generation function in the same way.
Plus, you only have to evaluate it once per insert or update instead
of log(N) times.  But it is harder to understand, I grant.  And while
you can trivially rewrite a sortkey function as a comparison function,
the reverse isn't true at all.  So I'm not totally sure.
Received on Thursday, 5 May 2011 23:23:18 UTC