Re: [IndexedDB] Closing on bug 9903 (collations) from Keean Schupke on 2011-05-06 (public-webapps@w3.org from April to June 2011)

From: Keean Schupke <keean@fry-it.com>
Date: Fri, 6 May 2011 08:07:10 +0100
To: Aryeh Gregor <Simetrical+w3c@gmail.com>
Cc: Jonas Sicking <jonas@sicking.cc>, Pablo Castro <Pablo.Castro@microsoft.com>, "public-webapps@w3.org" <public-webapps@w3.org>
Message-ID: <BANLkTikK5h-8u7kURDFPp_xTt05FKC_9mA@mail.gmail.com>
On 6 May 2011 00:22, Aryeh Gregor <Simetrical+w3c@gmail.com> wrote:

> On Thu, May 5, 2011 at 2:12 AM, Keean Schupke <keean@fry-it.com> wrote:
> > What if the new version uses the same property name for a different
> thing?
>
> Yes, obviously it's going to be possible for code changes to cause
> hard-to-catch bugs due to not updating the database correctly.  We
> don't have to add more cases where that's possible than necessary,
> without good reason.  Maybe there's good reason here, but the added
> potential for error can't be neglected as a cost.
>

I have seen many bugs in real databases due to stored procedures.


>
> > Why would you need to read it. Every time you open the database you would
> > need to check the function is the one you expect.
>
> Not if you never intend to change it, or don't care if it's outdated.
> I expect this to be the most common case.
>

People don't change the language setting in an application?


>
> Consider the case of someone using CLDR-tailored UCA and a new version
> comes out.  You want to use the newest version for new indexes, if
> multiple versions are available, but there's no pressing need to
> automatically update existing indexes.  The old version is almost
> certainly good enough, unless your users use obscure languages.  So in
> my scheme, you can just update the function in your code and do
> nothing else.  In your scheme, you'd have to either stick to the old
> version across the board, or include both versions in your code
> indefinitely and include out-of-band logic to choose between them, or
> write a script that rebuilds the whole index on update (which would
> take a long time for a large index).
>

At least then the logic to chose between collations is visible in the code,
rather than hidden. This is all about transparency and making sure the
programmer has control of what is happening, rather than locking them into
limiting patterns, and giving them the ability to see exactly what the code
will do by reading and code-reviewing it.

With a stored procedure, what happens when a function you call (that is not
stored) changes?

The only way to be sure is to run a validation check in the index (run from
beginning to end checking the order is consistent with the comparison
function). That is the same whether you use stores procedures or not.


>
> > The code would have to
> > contain the function so it can compare it with the one in the DB and
> update
> > it if necessary. If the code contains the function there are two copies
> of
> > the function, one in the database and one in the code? which one is
> correct?
> > which one is it using? So sometimes you will write the new function to
> the
> > database, and sometimes you will not? More paths to test in code
> coverage,
> > more complexity. Its simpler to just always set the function when opening
> > the database.
>
> If the collation function is stored in the database, then I'd expect
> setting the function to rebuild the index if the new and old functions
> differ.  This could happen as a background operation, with the
> existing index still usable (with the old collation function) in the
> meantime.  So if you always wanted collations up-to-date, in my scheme
> authors could just set the function every time they open the database,
> as with your scheme.  But this could trigger a silent rebuild whenever
> necessary, so the author doesn't have to worry about it.  In your
> scheme, the author has to do the rebuild himself, and if he gets it
> wrong, the index will be corrupted.
>
> So as I see it, my approach is easier to use across the board.  It
> lets you not update collations on old tables without requiring you to
> keep track of multiple collation function versions, and it also
> potentially lets you update collations on old tables to the latest
> versions with rebuilding done for you in the background.  Critically,
> it does not let you change a sort function without rebuilding, since
> that will always cause bugs and you never want to do it (to a first
> approximation).
>
> Of course, maybe an initial implementation wouldn't do rebuilds for
> you, to keep it simple.  Then the collation function would be
> immutable after index creation, so you'd still have to do rebuilds
> yourself.  But it would still be easier and safer: the old index will
> still work in the interim even if you don't have the old version of
> your collation function around, and you can't mess up and get a
> corrupted index.
>
> > Thinking about this a bit more. If you change the collation function you
> > need to re-sort the index to make sure it will work (and avoid those
> strange
> > bugs). Storing the function in the DB enables you to compare the function
> > and only change it when you need to, thus optimising the number of
> re-sorts.
> > That is the _only_ advantage to storing the function - as you still need
> to
> > check the function stored is the one you expect to guarantee your code
> will
> > run properly. So with a non-persisted function we need to sort every time
> we
> > open to make sure the order is correct.
>
> And this is totally impractical for even moderately large datasets.  I
> assume we want this to be usable for databases of, say, a gigabyte in
> size.  You're not going to read, sort, and write a gigabyte on every
> database open.
>

You just have to read-only scan the index once to validate the order (the
most usual case).


>
> (My experience tends more toward multi-gigabyte databases or bigger,
> including writing code for Wikipedia, which is multi-terabyte.  So
> maybe I'm biased to think about scalability more than necessary for
> IndexedDB, but resorting the index on every index still sounds really
> impractical to me.)
>

I have written a high bandwidth redundant multi-terabyte key-value store for
one of the large mobile telecoms companies, so I understand your concerns. I
wouldn't be using IndexedDB in a web-browser for this sort of application.
However an index re-scan to validate against creeping-corruption is not a
bad idea. This scan is sequential and read only so need not take that long.


>
> > However, if we attach a version
> > number to the index, we can check the version number in out code to know
> if
> > we need to resort the index. The simplest API for this would be:
> > index.setCollation(1.1, my_collation_function);
> > So the version number is checked against the index. If it is the same,
> the
> > supplied collation function is used without re-sorting the index. If it
> is
> > different the index order is checked/re-sorted. All you have to do is
> > remember to up the version number. Local testing before rolling out the
> > changes should catch failure to do so.
>
> So then what happens if it's not up-to-date?  You don't do any updates
> to the object store until you've sorted and rewritten the index with
> the new collation function?  Even on fairly modest data sets, that
> could easily take ten seconds or more, which is not latency we want to
> impose unnecessarily.  In contrast, if the collation function is
> stored in the DB, the existing index will still work until a new index
> is built.  It could even be done automatically, as I noted.
>

I still think the problems associated with stored procedures are worse.


>
> > A comparison function would be a lot simpler for the user to write.
>
> It depends.  For instance, for basic English collation, you can get
> the sortkey by stripping punctuation and diacritics and lowercasing
> everything.  That's easier to write as a sortkey-generating function
> than a comparison function.  In fact, if you were writing a comparison
> function, you'd probably do it by normalizing both values and then
> doing a binary comparison -- basically generating a sortkey in the end
> anyway.
>
> On the other hand, it's very easy to get comparison functions wrong,
> like by not making them stable.  That mistake has been made by
> organizations as large and professional as Microsoft:
>
>
> http://arstechnica.com/microsoft/news/2010/03/coding-error-leads-to-uneven-eu-browser-ballot-distribution.ars
>
> You can't mess up a sortkey generation function in the same way.
> Plus, you only have to evaluate it once per insert or update instead
> of log(N) times.  But it is harder to understand, I grant.  And while
> you can trivially rewrite a sortkey function as a comparison function,
> the reverse isn't true at all.  So I'm not totally sure.
>

I think you could mess up sortkey generation just as badly. A function that
maps all keys to the same sortkey is pretty broken. Sortkey generation is
not trival. Consider unicode UTF-8 strings where there are multi-byte
characters that need to be included in the order.


Cheers,
Keean.
Received on Friday, 6 May 2011 07:07:39 UTC