RE: [IndexedDB] Current editor's draft from Pablo Castro on 2010-07-15 (public-webapps@w3.org from July to September 2010)

From: Pablo Castro <Pablo.Castro@microsoft.com>
Date: Thu, 15 Jul 2010 00:03:45 +0000
To: Jonas Sicking <jonas@sicking.cc>
CC: Andrei Popescu <andreip@google.com>, Nikunj Mehta <nikunj@o-micron.com>, public-webapps <public-webapps@w3.org>
Message-ID: <F753B2C401114141B426DB383C8885E058EF0404@TK5EX14MBXC124.redmond.corp.microsoft.>
From: Jonas Sicking [mailto:jonas@sicking.cc] 
Sent: Wednesday, July 14, 2010 12:07 AM

>> > Dynamic transactions:
>> > I see that most folks would like to see these going away. While I like the predictability and simplifications that we're able to make by using static scopes for transactions, I worry that we'll close the door for two scenarios: background tasks and query processors. Background tasks such as synchronization and post-processing of content would seem to be almost impossible with the static scope approach, mostly due to the granularity of the scope specification (whole stores). Are we okay with saying that you can't for example sync something in the background (e.g. in a worker) while your app is still working? Am I missing something that would enable this class of scenarios? Query processors are also tricky because you usually take the query specification in some form after the transaction started (especially if you want to execute multiple queries with later queries depending on the outcome of the previous ones). The background tasks issue in particular looks pretty painful to me if we don't have a way to achieve it without freezing the application while it happens.

>> I don't understand enough of the details here to be able to make a
>> decision. The use cases you are bringing up I definitely agree are
>> important, but I would love to look at even a rough draft of what code
>> you are expecting people will need to write.

I'll try and hack up and example. In general any scenario that has a worker and the UI thread working on the same database will be quite a challenge, because the worker will have to a) split the work in small pieces, even if it was naturally a bigger chunk and b) consider interleaving implications with the UI thread, otherwise even when split in chunks you're not guaranteed that one of the two will starve the other one (the worker running on a tight loop will effectively always have an active transaction, it'll be just changing the actual transaction from time to time). This can certainly happen with dynamic transactions as well, the only difference is that since the locking granularity is different, it may be that what you're working on in the worker and in the UI threads is independent enough that they don't interfere too much, allowing for some more concurrency.

>> What I suggest is that we keep dynamic transactions in the spec for
>> now, but separate the API from static transactions, start a separate
>> thread and try to hammer out the details and see what we arrive at. I
>> do want to clarify that I don't think dynamic transactions are
>> particularly hard to implement, I just suspect they are hard to use
>> correctly.

Sounds reasonable.

>> > Implicit commit:
>> > Does this really work? I need to play with sample app code more, it may just be that I'm old-fashioned. For example, if I'm downloading a bunch of data form somewhere and pushing rows into the store within a transaction, wouldn't it be reasonable to do the whole thing in a transaction? In that case I'm likely to have to unwind while I wait for the next callback from XmlHttpRequest with the next chunk of data.

>> You definitely want to do it in a transaction. In our proposal there
>> is no way to even call .get or .put if you aren't inside a
>> transaction. For the case you are describing, you'd download the data
>> using XMLHttpRequest first. Once the data has been downloaded you
>> start a transaction, parse the data, and make the desired
>> modifications. Once that is done the transaction is automatically
>> committed.
>>
>> The idea here is to avoid keeping transactions open for long periods
>> of time, while at the same time making the API easier to work with.
>> I'm very concerned that any API that requires people to do:
>>
>> startOperation();
>>... do lots of stuff here ...
>> endOperation();
>>
>> people will forget to do the endOperation call. This is especially
>> true if the startOperation/endOperation calls are spread out over
>> multiple different asynchronously called functions, which seems to be
>> the use case you're concerned about above. One very easy way to
>> "forget" to call endOperation is if something inbetween the two
>> function calls throw an exception.

Fair enough, maybe I need to think of this scenario differently, and if someone needs to download a bunch of data and then put it in the database atomically the right way is to download to work tables first over a long time and independent transactions, and then use a transaction only to move the data around into its final spot.

>> This will likely be extra bad for transactions where no write
>> operations are done. In this case failure to call a 'commit()'
>> function won't result in any broken behavior. The transaction will
>> just sit open for a long time and eventually "rolled back", though
>> since no changes were done, the rollback is transparent, and the only
>> noticeable effect is that the application halts for a while while the
>> transaction is waiting to time out.
>>
>> I should add that the WebSQLDatabase uses automatically committing
>> transactions very similar to what we're proposing, and it seems to
>> have worked fine there.

I find this a bit scary, although it could be that I'm permanently tainted with traditional database stuff. Typical databases follow a presumed abort protocol, where if your code is interrupted by an exception, a process crash or whatever, you can always assume transactions will be rolled back if you didn't reach an explicit call to commit. The implicit commit here takes that away, and I'm not sure how safe that is.

For example, if I don't have proper exception handling in place, an illegal call to some other non-indexeddb related API may throw an exception causing the whole thing to unwind, at which point nothing will be pending to do in the database and thus the currently active transaction will be committed. 

Using the same line of thought we used for READ_ONLY, forgetting to call commit() is easy to detect the first time you try out your code. Your changes will simply not stick. It's not as clear as the READ_ONLY example because there is no opportunity to throw an explicit exception with an explanation, but the data not being around will certainly prompt developers to look for the issue :)

>> And as you say, you still usually need error callbacks. In fact, we
>> have found while writing examples using our implementation, that you
>> almost always want to add a generic error handler. It's very easy to
>> make a mistake, and if you don't add error handlers then these just go
>> by silently, offering no help as to why your program isn't working.
>> Though possibly a better implementation could put information in the
>> developer console if it detected that an error event was fired but
>> no-one was listening.

Somewhat unrelated, but I wonder if we should consider a global (per database session) error handler or something like that. Database operations are just too granular, so maybe the usual deal where you setup an error handler per-operation is not the right thing to do. 

>> > Nested transactions:
>> > Not sure why we're considering this an advanced scenario. To be clear about what the feature means to me: make it legal to start a transaction when one is already in progress, and the nested one is effectively a no-op, just refcounts the transaction, so you need equal amounts of commit()'s, implicit or explicit, and an abort() cancels all nested transactions. The purpose of this is to allow composition, where a piece of code that needs a transaction can start one locally, independently of whether the caller had already one going.

>> Ah. I generally though of nested transactions as the ability to roll
>> back just an "inner" transaction, while keeping the changes made by an
>> outer one. Your version of nested transactions would be a lot easier
>> to implement I suspect.

No, that one is much fancier, I wasn't talking about that (at least in SQL Server we call those save points to distinguish from simple nesting).

>> I take it the reason one would want to start a nested transaction,
>> rather than simply create a new one, is to ensure that all locks are
>> being kept held, and that no other changes from another transaction
>> can slip in between?

You can't create a new one because you'd get isolated from the work done so far. Usually when you write re-usable functions or libraries what you want is the work in the library to happen in a transacted way, either your own or the existing one if someone calls you with an active one. 
 
>> If so, wouldn't the API require that the outer transaction is somehow
>> referenced (through an function argument or otherwise)? And if so,
>> couldn't the inner could simply use the outer transaction? This seems
>> to be the case in the current drafts.

This might work, it depends on what's not reachable from the transaction object. Usually you want to pass 1 thing to your reusable function/library. Passing the database object ensures that you'll have everything you may need about the database accessible. If you pass a transaction, then you can't do things that depend on you having the database object (e.g. create range objects). As I write this I'm realizing that perhaps this is just a matter of including a pointer to the database in the transaction.

>> > Schema versioning:
>> > It's unfortunate that we need to have explicit elements in the page for the versioning protocol to work, but the fact that we can have a reliable mechanism for pages to coordinate a version bump is really nice. For folks that don't know about this the first time they build it, an explicit error message on the schema change timeout can explain where to start. I do think that there may be a need for non-breaking changes to the schema to happen without a "version dance". For example, query processors regularly create temporary tables during sorts and such. Those shouldn't require any coordination (maybe we allow non-versioned additions, or we just introduce temporary, unnamed tables that evaporate on commit() or database close()...).

>> If we do need support for temporary objectStores, I think we should
>> simply add an API like:
>>
>> interface IDBTransaction {
>>   ...
>>   IDBObjectStore createTemporaryObjectStore();
>>   ...
>> };
>>
>> I.e. do the unnamed evaporating objectStores that go away on commit.
>>
>> That way the application doesn't have to worry about name collisions
>> if two transactions run at the same time, or worry about forgetting to
>> call removeObjectStore at the end of the transaction. We could also
>> allow temporary objectStores be written to, even if the transaction
>> was opened in READ_ONLY mode.
>>
>> In fact, since a temporary objectStore generally should be removed at
>> the end of a transaction, if we used the normal createObjectStore,
>> wouldn't we have to require a "breaking" version-change transaction
>> since removeObjectStore is called at the end of the transaction? I.e.
>> temporary object stores can't be implemented using add-only
>> non-versioned transactions.

Sold!

>> However I'd love to hear about when temporary objectStores are used. I
>> know I've seen them in evaluation strategies created by SQL databases,
>> but I wasn't able to come up with an example off the top of my head
>> for a use case that would require them.

Yes, it tends to be SQL-ish scenarios, even if SQL is not involved. For example, if you need to sort a large set you may want to sort it in blocks and keep the blocks on this. Note that this is not just about running out of memory, but also about not getting even close to it and make the rest of the system slow down because of lack of resources. 

-pablo
Received on Thursday, 15 July 2010 00:04:24 UTC