- From: Aryeh Gregor <Simetrical+w3c@gmail.com>
- Date: Thu, 5 May 2011 19:22:27 -0400
- To: Keean Schupke <keean@fry-it.com>
- Cc: Jonas Sicking <jonas@sicking.cc>, Pablo Castro <Pablo.Castro@microsoft.com>, "public-webapps@w3.org" <public-webapps@w3.org>
On Thu, May 5, 2011 at 2:12 AM, Keean Schupke <keean@fry-it.com> wrote: > What if the new version uses the same property name for a different thing? Yes, obviously it's going to be possible for code changes to cause hard-to-catch bugs due to not updating the database correctly. We don't have to add more cases where that's possible than necessary, without good reason. Maybe there's good reason here, but the added potential for error can't be neglected as a cost. > Why would you need to read it. Every time you open the database you would > need to check the function is the one you expect. Not if you never intend to change it, or don't care if it's outdated. I expect this to be the most common case. Consider the case of someone using CLDR-tailored UCA and a new version comes out. You want to use the newest version for new indexes, if multiple versions are available, but there's no pressing need to automatically update existing indexes. The old version is almost certainly good enough, unless your users use obscure languages. So in my scheme, you can just update the function in your code and do nothing else. In your scheme, you'd have to either stick to the old version across the board, or include both versions in your code indefinitely and include out-of-band logic to choose between them, or write a script that rebuilds the whole index on update (which would take a long time for a large index). > The code would have to > contain the function so it can compare it with the one in the DB and update > it if necessary. If the code contains the function there are two copies of > the function, one in the database and one in the code? which one is correct? > which one is it using? So sometimes you will write the new function to the > database, and sometimes you will not? More paths to test in code coverage, > more complexity. Its simpler to just always set the function when opening > the database. If the collation function is stored in the database, then I'd expect setting the function to rebuild the index if the new and old functions differ. This could happen as a background operation, with the existing index still usable (with the old collation function) in the meantime. So if you always wanted collations up-to-date, in my scheme authors could just set the function every time they open the database, as with your scheme. But this could trigger a silent rebuild whenever necessary, so the author doesn't have to worry about it. In your scheme, the author has to do the rebuild himself, and if he gets it wrong, the index will be corrupted. So as I see it, my approach is easier to use across the board. It lets you not update collations on old tables without requiring you to keep track of multiple collation function versions, and it also potentially lets you update collations on old tables to the latest versions with rebuilding done for you in the background. Critically, it does not let you change a sort function without rebuilding, since that will always cause bugs and you never want to do it (to a first approximation). Of course, maybe an initial implementation wouldn't do rebuilds for you, to keep it simple. Then the collation function would be immutable after index creation, so you'd still have to do rebuilds yourself. But it would still be easier and safer: the old index will still work in the interim even if you don't have the old version of your collation function around, and you can't mess up and get a corrupted index. > Thinking about this a bit more. If you change the collation function you > need to re-sort the index to make sure it will work (and avoid those strange > bugs). Storing the function in the DB enables you to compare the function > and only change it when you need to, thus optimising the number of re-sorts. > That is the _only_ advantage to storing the function - as you still need to > check the function stored is the one you expect to guarantee your code will > run properly. So with a non-persisted function we need to sort every time we > open to make sure the order is correct. And this is totally impractical for even moderately large datasets. I assume we want this to be usable for databases of, say, a gigabyte in size. You're not going to read, sort, and write a gigabyte on every database open. (My experience tends more toward multi-gigabyte databases or bigger, including writing code for Wikipedia, which is multi-terabyte. So maybe I'm biased to think about scalability more than necessary for IndexedDB, but resorting the index on every index still sounds really impractical to me.) > However, if we attach a version > number to the index, we can check the version number in out code to know if > we need to resort the index. The simplest API for this would be: > index.setCollation(1.1, my_collation_function); > So the version number is checked against the index. If it is the same, the > supplied collation function is used without re-sorting the index. If it is > different the index order is checked/re-sorted. All you have to do is > remember to up the version number. Local testing before rolling out the > changes should catch failure to do so. So then what happens if it's not up-to-date? You don't do any updates to the object store until you've sorted and rewritten the index with the new collation function? Even on fairly modest data sets, that could easily take ten seconds or more, which is not latency we want to impose unnecessarily. In contrast, if the collation function is stored in the DB, the existing index will still work until a new index is built. It could even be done automatically, as I noted. > A comparison function would be a lot simpler for the user to write. It depends. For instance, for basic English collation, you can get the sortkey by stripping punctuation and diacritics and lowercasing everything. That's easier to write as a sortkey-generating function than a comparison function. In fact, if you were writing a comparison function, you'd probably do it by normalizing both values and then doing a binary comparison -- basically generating a sortkey in the end anyway. On the other hand, it's very easy to get comparison functions wrong, like by not making them stable. That mistake has been made by organizations as large and professional as Microsoft: http://arstechnica.com/microsoft/news/2010/03/coding-error-leads-to-uneven-eu-browser-ballot-distribution.ars You can't mess up a sortkey generation function in the same way. Plus, you only have to evaluate it once per insert or update instead of log(N) times. But it is harder to understand, I grant. And while you can trivially rewrite a sortkey function as a comparison function, the reverse isn't true at all. So I'm not totally sure.
Received on Thursday, 5 May 2011 23:23:18 UTC