Re: [IndexedDB] Current editor's draft from Jonas Sicking on 2010-07-15 (public-webapps@w3.org from July to September 2010)

From: Jonas Sicking <jonas@sicking.cc>
Date: Wed, 14 Jul 2010 17:42:54 -0700
To: Pablo Castro <Pablo.Castro@microsoft.com>
Cc: Andrei Popescu <andreip@google.com>, Nikunj Mehta <nikunj@o-micron.com>, public-webapps <public-webapps@w3.org>
Message-ID: <AANLkTinLIx_de3lJBngJ2mu2XVEOfIPGTCyWtYEX6x58@mail.gmail.com>
On Wed, Jul 14, 2010 at 5:03 PM, Pablo Castro
<Pablo.Castro@microsoft.com> wrote:
>
> From: Jonas Sicking [mailto:jonas@sicking.cc]
> Sent: Wednesday, July 14, 2010 12:07 AM
>
>>> > Dynamic transactions:
>>> > I see that most folks would like to see these going away. While I like the predictability and simplifications that we're able to make by using static scopes for transactions, I worry that we'll close the door for two scenarios: background tasks and query processors. Background tasks such as synchronization and post-processing of content would seem to be almost impossible with the static scope approach, mostly due to the granularity of the scope specification (whole stores). Are we okay with saying that you can't for example sync something in the background (e.g. in a worker) while your app is still working? Am I missing something that would enable this class of scenarios? Query processors are also tricky because you usually take the query specification in some form after the transaction started (especially if you want to execute multiple queries with later queries depending on the outcome of the previous ones). The background tasks issue in particular looks pretty painful to me if we don't have a way to achieve it without freezing the application while it happens.
>
>>> I don't understand enough of the details here to be able to make a
>>> decision. The use cases you are bringing up I definitely agree are
>>> important, but I would love to look at even a rough draft of what code
>>> you are expecting people will need to write.
>
> I'll try and hack up and example. In general any scenario that has a worker and the UI thread working on the same database will be quite a challenge, because the worker will have to a) split the work in small pieces, even if it was naturally a bigger chunk and b) consider interleaving implications with the UI thread, otherwise even when split in chunks you're not guaranteed that one of the two will starve the other one (the worker running on a tight loop will effectively always have an active transaction, it'll be just changing the actual transaction from time to time). This can certainly happen with dynamic transactions as well, the only difference is that since the locking granularity is different, it may be that what you're working on in the worker and in the UI threads is independent enough that they don't interfere too much, allowing for some more concurrency.

I think what I'm struggling with is how dynamic transactions will help
since they are still doing whole-objectStore locking. I'm also curious
how you envision people dealing with deadlock hazards. Nikunjs
examples in the beginning of this thread simply throw up their hands
and report an error if there was a deadlock. That is obviously not
good enough for an actual application.

So in short, looking forward to an example :)

>>> > Implicit commit:
>>> > Does this really work? I need to play with sample app code more, it may just be that I'm old-fashioned. For example, if I'm downloading a bunch of data form somewhere and pushing rows into the store within a transaction, wouldn't it be reasonable to do the whole thing in a transaction? In that case I'm likely to have to unwind while I wait for the next callback from XmlHttpRequest with the next chunk of data.
>
>>> You definitely want to do it in a transaction. In our proposal there
>>> is no way to even call .get or .put if you aren't inside a
>>> transaction. For the case you are describing, you'd download the data
>>> using XMLHttpRequest first. Once the data has been downloaded you
>>> start a transaction, parse the data, and make the desired
>>> modifications. Once that is done the transaction is automatically
>>> committed.
>>>
>>> The idea here is to avoid keeping transactions open for long periods
>>> of time, while at the same time making the API easier to work with.
>>> I'm very concerned that any API that requires people to do:
>>>
>>> startOperation();
>>>... do lots of stuff here ...
>>> endOperation();
>>>
>>> people will forget to do the endOperation call. This is especially
>>> true if the startOperation/endOperation calls are spread out over
>>> multiple different asynchronously called functions, which seems to be
>>> the use case you're concerned about above. One very easy way to
>>> "forget" to call endOperation is if something inbetween the two
>>> function calls throw an exception.
>
> Fair enough, maybe I need to think of this scenario differently, and if someone needs to download a bunch of data and then put it in the database atomically the right way is to download to work tables first over a long time and independent transactions, and then use a transaction only to move the data around into its final spot.

Yeah, I think that this is what we want to encourage.

>>> This will likely be extra bad for transactions where no write
>>> operations are done. In this case failure to call a 'commit()'
>>> function won't result in any broken behavior. The transaction will
>>> just sit open for a long time and eventually "rolled back", though
>>> since no changes were done, the rollback is transparent, and the only
>>> noticeable effect is that the application halts for a while while the
>>> transaction is waiting to time out.
>>>
>>> I should add that the WebSQLDatabase uses automatically committing
>>> transactions very similar to what we're proposing, and it seems to
>>> have worked fine there.
>
> I find this a bit scary, although it could be that I'm permanently tainted with traditional database stuff. Typical databases follow a presumed abort protocol, where if your code is interrupted by an exception, a process crash or whatever, you can always assume transactions will be rolled back if you didn't reach an explicit call to commit. The implicit commit here takes that away, and I'm not sure how safe that is.
>
> For example, if I don't have proper exception handling in place, an illegal call to some other non-indexeddb related API may throw an exception causing the whole thing to unwind, at which point nothing will be pending to do in the database and thus the currently active transaction will be committed.
>
> Using the same line of thought we used for READ_ONLY, forgetting to call commit() is easy to detect the first time you try out your code. Your changes will simply not stick. It's not as clear as the READ_ONLY example because there is no opportunity to throw an explicit exception with an explanation, but the data not being around will certainly prompt developers to look for the issue :)

Ah, I see where we are differing in thinking. My main concern has been
that of rollbacks, and associated dataloss, in the non-error case. For
example people forget to call commit() in some branch of their code,
thus causing dataloss when the transaction is rolled back.

Your concern seems to be that of lack of rollback in the error case,
for example when an exception is thrown and not caught somewhere in
the code. In this case you'd want to have the transaction rolled back.

One way to handle this is to try to detect unhandled errors and
implicitly roll back the transaction. Two situations where we could do
this is:
1. When an 'error' event is fired, but where .preventDefault() has is
not called by any handler. The result is that if an error is ever
fired, but no one explicitly handles it, we roll back the transaction.
See also below.
2. When a success handler is called, but the handler throws an exception.

The second is a bit of a problem from a spec point of view. I'm not
sure it is allowed by the DOM Events spec, or by all existing DOM
Events implementations. I do still think we can pull it off though.
This is something I've been thinking about raising for a while, but I
wanted to nail down the raised issues first.

Would you feel more comfortable with implicit commits if we did the above?

>>> And as you say, you still usually need error callbacks. In fact, we
>>> have found while writing examples using our implementation, that you
>>> almost always want to add a generic error handler. It's very easy to
>>> make a mistake, and if you don't add error handlers then these just go
>>> by silently, offering no help as to why your program isn't working.
>>> Though possibly a better implementation could put information in the
>>> developer console if it detected that an error event was fired but
>>> no-one was listening.
>
> Somewhat unrelated, but I wonder if we should consider a global (per database session) error handler or something like that. Database operations are just too granular, so maybe the usual deal where you setup an error handler per-operation is not the right thing to do.

This is a great idea. What we could do is to first fire the 'error'
event on the IDBRequest. If .preventDefault() is not called, we'll
fire an 'error' event on the IDBTransaction. If .preventDefault()
still hasn't been called, we fire an 'error' event on the IDBDatabase.
If .preventDefault() *still* hasn't been called, we roll back the
transaction.

This is very similar to error handling in Workers. It might be overly
complex implementation wise, but it does seem like a nice API for
authors.

>>> > Nested transactions:
>>> > Not sure why we're considering this an advanced scenario. To be clear about what the feature means to me: make it legal to start a transaction when one is already in progress, and the nested one is effectively a no-op, just refcounts the transaction, so you need equal amounts of commit()'s, implicit or explicit, and an abort() cancels all nested transactions. The purpose of this is to allow composition, where a piece of code that needs a transaction can start one locally, independently of whether the caller had already one going.
>
>>> Ah. I generally though of nested transactions as the ability to roll
>>> back just an "inner" transaction, while keeping the changes made by an
>>> outer one. Your version of nested transactions would be a lot easier
>>> to implement I suspect.
>
> No, that one is much fancier, I wasn't talking about that (at least in SQL Server we call those save points to distinguish from simple nesting).
>
>>> I take it the reason one would want to start a nested transaction,
>>> rather than simply create a new one, is to ensure that all locks are
>>> being kept held, and that no other changes from another transaction
>>> can slip in between?
>
> You can't create a new one because you'd get isolated from the work done so far. Usually when you write re-usable functions or libraries what you want is the work in the library to happen in a transacted way, either your own or the existing one if someone calls you with an active one.
>
>>> If so, wouldn't the API require that the outer transaction is somehow
>>> referenced (through an function argument or otherwise)? And if so,
>>> couldn't the inner could simply use the outer transaction? This seems
>>> to be the case in the current drafts.
>
> This might work, it depends on what's not reachable from the transaction object. Usually you want to pass 1 thing to your reusable function/library. Passing the database object ensures that you'll have everything you may need about the database accessible. If you pass a transaction, then you can't do things that depend on you having the database object (e.g. create range objects). As I write this I'm realizing that perhaps this is just a matter of including a pointer to the database in the transaction.

Is there anything you need that isn't on the IDBTransaction object?
Depending on what information it is that is needed, maybe we can
simply add more things to IDBTransaction.

In any case, we already have the IDBTransaction.db property, which is
the pointer to the database from the transaction that you are
suggesting. Would that be enough?

>>> > Schema versioning:
>>> > It's unfortunate that we need to have explicit elements in the page for the versioning protocol to work, but the fact that we can have a reliable mechanism for pages to coordinate a version bump is really nice. For folks that don't know about this the first time they build it, an explicit error message on the schema change timeout can explain where to start. I do think that there may be a need for non-breaking changes to the schema to happen without a "version dance". For example, query processors regularly create temporary tables during sorts and such. Those shouldn't require any coordination (maybe we allow non-versioned additions, or we just introduce temporary, unnamed tables that evaporate on commit() or database close()...).
>
>>> If we do need support for temporary objectStores, I think we should
>>> simply add an API like:
>>>
>>> interface IDBTransaction {
>>>   ...
>>>   IDBObjectStore createTemporaryObjectStore();
>>>   ...
>>> };
>>>
>>> I.e. do the unnamed evaporating objectStores that go away on commit.
>>>
>>> That way the application doesn't have to worry about name collisions
>>> if two transactions run at the same time, or worry about forgetting to
>>> call removeObjectStore at the end of the transaction. We could also
>>> allow temporary objectStores be written to, even if the transaction
>>> was opened in READ_ONLY mode.
>>>
>>> In fact, since a temporary objectStore generally should be removed at
>>> the end of a transaction, if we used the normal createObjectStore,
>>> wouldn't we have to require a "breaking" version-change transaction
>>> since removeObjectStore is called at the end of the transaction? I.e.
>>> temporary object stores can't be implemented using add-only
>>> non-versioned transactions.
>
> Sold!

Yay! :)

>>> However I'd love to hear about when temporary objectStores are used. I
>>> know I've seen them in evaluation strategies created by SQL databases,
>>> but I wasn't able to come up with an example off the top of my head
>>> for a use case that would require them.
>
> Yes, it tends to be SQL-ish scenarios, even if SQL is not involved. For example, if you need to sort a large set you may want to sort it in blocks and keep the blocks on this. Note that this is not just about running out of memory, but also about not getting even close to it and make the rest of the system slow down because of lack of resources.

Can you clarify what you mean by "sort it in blocks and keep the
blocks on this". Not sure it that sentence got cut off?

/ Jonas
Received on Thursday, 15 July 2010 00:43:48 UTC