RE: [IndexedDB] Detailed comments for the current draft from Pablo Castro on 2010-02-03 (public-webapps@w3.org from January to March 2010)

From: Pablo Castro <Pablo.Castro@microsoft.com>
Date: Wed, 3 Feb 2010 03:37:55 +0000
To: Jeremy Orlow <jorlow@chromium.org>, Nikunj Mehta <nikunj@o-micron.com>
CC: "public-webapps@w3.org" <public-webapps@w3.org>
Message-ID: <F753B2C401114141B426DB383C8885E02C9CC1F0@TK5EX14MBXC128.redmond.corp.microsoft.>
On Mon, Feb 1, 2010 at 1:30 AM, Jeremy Orlow <jorlow@google.com> wrote:

> > > 1. Keys and sorting

> > > a.       3.1.1:  it would seem that having also date/time values as keys would be important and it's a common sorting criteria (e.g. as part of a composite primary key or in general as an index key).

> > The Web IDL spec does not support a Date/Time data type. Could your use case be supported by storing the underlying time with millisecond precision using an IDL long long type? I am willing to change the spec so that it allows long long instead of long IDL type, which will provide adequate support for Date and time sorting.

> Can the spec not be augmented?  It seems like other specs like WebGL have created their own types.  If not, I suppose your suggested change would suffice as well.  This does seem like an important use case.
 
I agree, either we could augment the spec or we could describe it in terms of Javascript object values. That is, we can say something specific about the treatment of Javascript's Date object. Would that be possible? E.g. we could require implementations to provide full order for dates if they find an instance of that type in a path.

> > > b.      3.1.1: similarly, sorting on number in general (not just integers/longs) would be important (e.g. price lists, scores, etc.)

> > I am once again hampered by Web IDL spec. Is it possible to leave this for future versions of the spec?

Actually Web IDL does define the "double" type and its Javascript binding. Can we add double to the list of types an index can be applied to?

> > > c.       3.1.1: cross type sorting and sorting of long values are clear. Sorting of strings however needs more elaboration. In particular, which collation do we use? Does the user or developer get to choose a collation? If we pick up a collation from the environment (e.g. the OS), if the collation changes we'd have to re-index all the databases.

> > I propose to use Unicode collation algorithm, which was also suggested by Jonas during a conversation.

I don't think this is specific enough, in that it still doesn't say which collation tables to use and how to specify them. A single collation strategy won't do for all languages (it'll range from slightly wrong to nonsense depending on the target language). This is a trickier area than I had initialize thought. We'll bake on this a bit and get back to this group with ideas. 

> > > d.      3.1.3: spec reads ".key path must be the name of an enumerated property."; how about composite keys (would make the related APIs take a DOMString or DOMStringList)

> > I prefer to leave composite keys to a future version.

I don't think we can get away with this. For indexes this is quite common (if anything else to have stable ordering when the prefix of the index has repeats). Once we have it for indexes the delta for having it for primary keys as well is pretty small (although I wouldn't oppose leaving out composite primary keys if that would help scope the feature).


> > > b.      Query processing libraries will need temporary stores, which need temporary names. Should we introduce an API for the creation of temporary stores with transaction lifetime and no name?

> > Firstly, I think we can leave this safely to a future version. Secondly, my suggestion would be to provide a parameter to the create call to indicate that an object store being created is a transient one, i.e., not backed by durable storage. They could be available across different transactions. If your intention is to not make these object stores unavailable across connections, then we can also offer a connection-specific transient object store.

> > In general, it requires us to introduce the notion of create params, which would simplify the evolution of the API. This is also similar to how Berkeley DB handles various options, not just those related to creation of a Berkeley "database".

Let's see how we progress on this one, and maybe revisit it a bit later. I'm worried about code that wants to do things such as a block-sort that needs to spill to disk, as it would have to either use some pattern or ask the user for temp table names.

> > > c.      It would be nice to have an estimate row count on each store. This comes at an implementation and runtime cost. Strong opinions? Lacking everything else, this would be the only statistic to base decisions on for a query processor.

> > I believe we need to have a general way of estimating the number of records in a cursor once a key range has been specified. Kris Zyp also brings this up in a separate email. I am willing to add an estimateCount attribute to IDBCursor for this.

EstimateCount sounds good.

> > > d.      The draft does not touch on how applications would do optimistic concurrency. A common way of doing this is to use a timestamp value that's automatically updated by the system every time someone touches the row. While we don't feel it's a must have, it certainly supports common scenarios.

> > Do you strongly feel that the manner in which optimistic concurrency is performed needs to be described in this spec? I don't.

Fair enough. Maybe we don't need to go that far, but I do wonder if we're providing all the necessary tools. For example, should we have an auto-updated timestamp type/column?


> > > 4. Indexes

> > > a.       3.1.4 mentions "auto-populated" indexes, but then there is no mention of other types. We suggest that we remove this and in the algorithms section describe side-effecting operations as always updating the indexes as well.

> > The idea is that an index is either auto-populated or not. If it is not auto-populated, it must be managed explicitly. This was a requirement we discussed to support complex cases such as composite keys.

> Can you elaborate?  I don't understand what you mean by this.  And I agree with Pablo that indexes that are not auto-populated do not seem like a priority for the first version of the spec.  It seems like they add complexity to the API and make maintaining database consistency more difficult without any major benefit to the user....but maybe I'm missing something?

I would trade non-auto-populated indexes for composite keys...I'm not sure what other cases manual indexes support, but if you need an arbitrary index that is manually maintained you can just create another table for that. 
 
> > I am reluctant to remove this feature. I can certainly clean it up so things are clearer.

> > > b.      If during insert/update the value of the key is not present (i.e. undefined as opposite to null or a value), is that a failure, does the row not get indexed, or is it indexed as null? Failure would probably cause a lot of trouble to users; the other two have correctness problems. An option is to index them as undefined, but now we have undefined and null as indexable keys. We lean toward this last option.

> > I haven't seen enough application experience around this to suggest that treating undefined as null would be the right thing to do. Unfortunately, creating a little bit of trouble for programmers to handle their use of undefined keys seems like the only safe thing to do.

Can we get this flagged as a pending issue in the spec? I feel that we have to sort this out before we call it reasonably complete.

> > > b.      Clarification on transactions: all database operations that affect the schema (create/remove store/index, setVersion, etc.) as well as data modification operations are assumed to be auto-commit by default, correct? Furthermore, all those operations (both schema and data) can happen within a transaction, including mixing schema and data changes. Does that line up with others' expectations? If so we should find a spot to articulate this explicitly.

> > The auto-commit mode, per my intention, is when an IDBDatabase object doesn't have a currentTransaction set. Is that what you meant?

Yes.

> > Moreover, in 3.2.9 I intended to allow the database itself to be identified as an object to be reserved for isolation from other transactions (in addition to the object stores and indexes). I can improve the spec text around this. This allows transactions in any of the three isolation modes to be used for schema operations in conjunction with data modification operations.

I hadn't realized that. Yes, clarification would be good.

This still doesn't address the second part of the question though: is it fair to assume that in a single transaction I can do ddl/dml?
 
> > > 6.       Transactions
> > > a.       While we understand the goal of simplifying developers' life with an error-free transactional model, we're not sure if we're making more harm by introducing more concepts into this space. Wouldn't it be better to use regular transactions with a well-known failure mode (e.g. either deadlocks or optimistic concurrency failure on commit)?

> > There has been prior discussion about this in the WG. I would suggest reading the thread on this [1]. I would be interested to see new implementation experience that either refutes or further supports a particular argument in that thread.

Thanks, I'll read the pointers and get back to this list if needed.

> > > b.    If in auto-commit mode, if two cursors are opened at the same time (e.g. to scan them in an interleaved way), are they in independent transactions simultaneously active in the same connection?

> > In the case of auto-commit, there will not be simultaneous transactions, because each modification commits before any subsequent modification can occur.

For single-step operations such us get/put/delete that makes sense. Cursors can interleave though. I mean, once you opened a cursor and/or fetched a row, you can do other things to the database object (including opening another cursor or doing a get/put/delete) while the cursor is open. If I don't have an explicit transaction, do each of those things happen in an autonomous transaction? I expect it to be the case. If so, it would need to be called out explicitly in the spec.


> > > 7. Algorithms

> > > d.      3.2.4.2: in our experiments writing application code, the fact that this method throws an exception when an item is not found is quite inconvenient. It would be much natural to just return undefined, as this can be a primary code path (to not find something) and not an exceptional situation. Same for 3.2.5, step 2 and 3.2.6 step 2.

> > I am not comfortable specifying the API to be dependent on the separation between undefined and null. Since null is a valid return value, it doesn't make sense to return that either. The only safe alternative appears to be to throw an error.

What do other folks think about this? I understand your concern, but it makes writing regular code really noisy as you need try/catch blocks to handle non-exceptional situations. 

> > As a means of improving usability, I propose adding another method "exists" which takes the same arguments as "get" and returns true or false. If a program doesn't know for sure whether a key exists in the database, it can use the exists method to avoid an exception.

Well, exists() would do a full lookup so wouldn't it be a waste to do an exists() followed by a get()? 

> > > 9. API

> > > f.        ObjectStoreSync: what happens to the reference if the underlying store is deleted through another connection? We propose it's ok to alter underlying objects in general and "visible" objects should be ready and start failing when the objects they surface go away or are altered.

> > The spec does not manage integrity constraints. It does what you expect and fails if the read operation on an index cannot find the referenced object.

Sorry, I didn't mean to refer to integrity constraints. I meant what happens to the actual handle that's opened (e.g. the store handle) when the underlying object gets deleted (assuming that the system doesn't block that).

> > > g.       CursorSync.openCursor: does the cursor start on the first record or before the first record? Should probably be before the first record so the first call to continue() can return false for empty stores, moving straight from BOF to EOF.

> > Cursor starts on the first record. The call to continue is not required until after you are done with the first value. The call to continue should not be required, if you are going to only read the first value in a cursor.

How do you know if you're at EOF in the case of an empty table if you don't have to call continue() to move to the first record? May be I'm missing a separate way of checking for EOF...

> > > 10.       API (async specifics)

> > a.       Currently the async API is only available on the window object and not to workers. Libraries are likely to target only one mode, in particular async, to work across all scenarios. So it would be important to have async also in workers.

> > I would be willing to edit this portion of the requirements, only once we have a stable API for the rest of the spec.

> I strongly agree with Pablo on this one.

Great, could we note this somewhere on the spec to make sure we come back to it before calling it done?


-pablo
Received on Wednesday, 3 February 2010 03:38:34 UTC