Re: [IndexedDB] Detailed comments for the current draft from Nikunj Mehta on 2010-02-01 (public-webapps@w3.org from January to March 2010)

From: Nikunj Mehta <nikunj@o-micron.com>
Date: Sun, 31 Jan 2010 23:33:37 -0800
To: Pablo Castro <Pablo.Castro@microsoft.com>
Cc: "public-webapps@w3.org" <public-webapps@w3.org>
Message-Id: <409835CB-5CF7-4967-98D4-5DFC7264C72C@o-micron.com>
On Jan 26, 2010, at 12:47 PM, Pablo Castro wrote:

> These are notes that we collected both from reviewing the spec  
> (editor's draft up to Jan 24th) and from a prototype implementation  
> that we are working on. I didn't realize we had this many notes,  
> otherwise I would have been sending intermediate notes early. Will  
> do so next round.
>
>
> 1. Keys and sorting
>
> a.       3.1.1:  it would seem that having also date/time values as  
> keys would be important and it's a common sorting criteria (e.g. as  
> part of a composite primary key or in general as an index key).

The Web IDL spec does not support a Date/Time data type. Could your  
use case be supported by storing the underlying time with millisecond  
precision using an IDL long long type? I am willing to change the spec  
so that it allows long long instead of long IDL type, which will  
provide adequate support for Date and time sorting.

> b.      3.1.1: similarly, sorting on number in general (not just  
> integers/longs) would be important (e.g. price lists, scores, etc.)

I am once again hampered by Web IDL spec. Is it possible to leave this  
for future versions of the spec?

> c.       3.1.1: cross type sorting and sorting of long values are  
> clear. Sorting of strings however needs more elaboration. In  
> particular, which collation do we use? Does the user or developer  
> get to choose a collation? If we pick up a collation from the  
> environment (e.g. the OS), if the collation changes we'd have to re- 
> index all the databases.

I propose to use Unicode collation algorithm, which was also suggested  
by Jonas during a conversation.

> d.      3.1.3: spec reads "…key path must be the name of an  
> enumerated property…"; how about composite keys (would make the  
> related APIs take a DOMString or DOMStringList)

I prefer to leave composite keys to a future version.

>
>
> 2. Values
>
> a.       3.1.2: isn't the requirement for "structured clones" too  
> much? It would mean implementations would have to be able to store  
> and retrieve File objects and such. Would it be more appropriate to  
> say it's just graphs of Javascript primitive objects/values (object,  
> string, number, date, arrays, null)?

Your list leaves out File, Blob, FileList, ImageData, and RegExp  
types. While I don't feel so strongly about all these types, I believe  
that support for Blob/File and ImageData will be beneficial to those  
who work with browsers. Instead of profiling this algorithm, I think  
it is best to just require the same algorithm.

>
>
> 3. Object store
>
> a.       3.1.3: do we really need in-line + out-of-line keys?  
> Besides the concept-count increase, we wonder whether out-of-line  
> keys would cause trouble to generic libraries, as the values for the  
> keys wouldn't be part of the values iterated when doing a "foreach"  
> over the table.

Certainly it is a matter of prioritizing among various requirements.  
Out-of-line keys enable people to store simple persistent hash maps. I  
think it would be wrong to require that data be always stored as  
objects. A library can always elide the availability of out-of-line  
keys if that poses a problem to its users.

> b.      Query processing libraries will need temporary stores, which  
> need temporary names. Should we introduce an API for the creation of  
> temporary stores with transaction lifetime and no name?

Firstly, I think we can leave this safely to a future version.  
Secondly, my suggestion would be to provide a parameter to the create  
call to indicate that an object store being created is a transient  
one, i.e., not backed by durable storage. They could be available  
across different transactions. If your intention is to not make these  
object stores unavailable across connections, then we can also offer a  
connection-specific transient object store.

In general, it requires us to introduce the notion of create params,  
which would simplify the evolution of the API. This is also similar to  
how Berkeley DB handles various options, not just those related to  
creation of a Berkeley "database".

> c.      It would be nice to have an estimate row count on each  
> store. This comes at an implementation and runtime cost. Strong  
> opinions? Lacking everything else, this would be the only statistic  
> to base decisions on for a query processor.

I believe we need to have a general way of estimating the number of  
records in a cursor once a key range has been specified. Kris Zyp also  
brings this up in a separate email. I am willing to add an  
estimateCount attribute to IDBCursor for this.

> d.      The draft does not touch on how applications would do  
> optimistic concurrency. A common way of doing this is to use a  
> timestamp value that's automatically updated by the system every  
> time someone touches the row. While we don't feel it's a must have,  
> it certainly supports common scenarios.

Do you strongly feel that the manner in which optimistic concurrency  
is performed needs to be described in this spec? I don't.

>
>
> 4. Indexes
>
> a.       3.1.4 mentions "auto-populated" indexes, but then there is  
> no mention of other types. We suggest that we remove this and in the  
> algorithms section describe side-effecting operations as always  
> updating the indexes as well.

The idea is that an index is either auto-populated or not. If it is  
not auto-populated, it must be managed explicitly. This was a  
requirement we discussed to support complex cases such as composite  
keys. I am reluctant to remove this feature. I can certainly clean it  
up so things are clearer.

> b.      If during insert/update the value of the key is not present  
> (i.e. undefined as opposite to null or a value), is that a failure,  
> does the row not get indexed, or is it indexed as null? Failure  
> would probably cause a lot of trouble to users; the other two have  
> correctness problems. An option is to index them as undefined, but  
> now we have undefined and null as indexable keys. We lean toward  
> this last option.

I haven't seen enough application experience around this to suggest  
that treating undefined as null would be the right thing to do.  
Unfortunately, creating a little bit of trouble for programmers to  
handle their use of undefined keys seems like the only safe thing to do.

> 5.       Databases
> a.       Not being able to enumerate database gets in the way of  
> creating good tools and frameworks such as database explorers. What  
> was the motivation for this? Is it security related?

Database explorers are best designed in the browser as an add-on or  
development tool. This would require additional interfaces not  
available to applications. This approach is consistent with usage  
experience around SQL databases and the database explorer built-in to  
Safari.

> b.      Clarification on transactions: all database operations that  
> affect the schema (create/remove store/index, setVersion, etc.) as  
> well as data modification operations are assumed to be auto-commit  
> by default, correct? Furthermore, all those operations (both schema  
> and data) can happen within a transaction, including mixing schema  
> and data changes. Does that line up with others' expectations? If so  
> we should find a spot to articulate this explicitly.

The auto-commit mode, per my intention, is when an IDBDatabase object  
doesn't have a currentTransaction set. Is that what you meant?

Moreover, in 3.2.9 I intended to allow the database itself to be  
identified as an object to be reserved for isolation from other  
transactions (in addition to the object stores and indexes). I can  
improve the spec text around this. This allows transactions in any of  
the three isolation modes to be used for schema operations in  
conjunction with data modification operations.

> c.       No way to delete a database? It would be reasonable for  
> applications to want to do that and let go of the user data (e.g. a  
> "forget me" feature in a web site)

There is currently no way to delete a database through an API. I can  
clarify this further, if needed. Of course, user interfaces can be  
developed to remove a database just like a cookie can be removed.  
Also, this style is similar to the approach taken in SQL database. Are  
there particular use cases that require programmatic ability to remove  
databases?

> 6.       Transactions
> a.       While we understand the goal of simplifying developers'  
> life with an error-free transactional model, we're not sure if we're  
> making more harm by introducing more concepts into this space.  
> Wouldn't it be better to use regular transactions with a well-known  
> failure mode (e.g. either deadlocks or optimistic concurrency  
> failure on commit)?

There has been prior discussion about this in the WG. I would suggest  
reading the thread on this [1]. I would be interested to see new  
implementation experience that either refutes or further supports a  
particular argument in that thread.

> b.    If in auto-commit mode, if two cursors are opened at the same  
> time (e.g. to scan them in an interleaved way), are they in  
> independent transactions simultaneously active in the same connection?

In the case of auto-commit, there will not be simultaneous  
transactions, because each modification commits before any subsequent  
modification can occur.

>
>
> 7. Algorithms
>
> a.       3.2.2: steps 4 and 5 are inverted in order.

Agreed.

> b.      3.2.2: when there is a key generator and the store uses in- 
> line keys, should the generated key value be propagated to the  
> original object (in addition to the clone), such that both are in  
> sync after the put operation?

This appears to be a thing that can easily be done by using the return  
value from the algorithm. I would like to not modify the object  
received to the extent possible.

> c.       3.2.3: step 2, probably editorial mistake? Wouldn't all  
> indexes have a key path?

Nope. An index that is not auto-managed will not have a key path. See  
3.1.4

> d.      3.2.4.2: in our experiments writing application code, the  
> fact that this method throws an exception when an item is not found  
> is quite inconvenient. It would be much natural to just return  
> undefined, as this can be a primary code path (to not find  
> something) and not an exceptional situation. Same for 3.2.5, step 2  
> and 3.2.6 step 2.

I am not comfortable specifying the API to be dependent on the  
separation between undefined and null. Since null is a valid return  
value, it doesn't make sense to return that either. The only safe  
alternative appears to be to throw an error.

As a means of improving usability, I propose adding another method  
"exists" which takes the same arguments as "get" and returns true or  
false. If a program doesn't know for sure whether a key exists in the  
database, it can use the exists method to avoid an exception.

> e.      The algorithm to put a new object into a store currently  
> indicates that the key of the object should be returned. How about  
> other values that may be generated by the store? For example, if the  
> store generates timestamps (not currently in the draft, but may be  
> needed for optimistic concurrency control), how would be return  
> them? should we update the actual object that was passed as a  
> parameter with keys and other server-generated values?

It will only be possible to return one value from a call. Given that  
the domain we are in is key-value databases, it makes sense to return  
a generated key from a call to store a value.

>
>
> 8. Performance and API style
>
> a.       The async nature of the API makes regular scans very heavy  
> on callbacks (one per row plus completion/error callbacks). This  
> slows down scans a lot, so when doing a multiple scans (e.g. a  
> reasonably complicated query that has joins, sorts and filters)  
> performance will be bound by this even if everything else happens  
> really fast. It would be interesting to support a block-fetch mode  
> where the callback gets called for a number of buffered rows  
> (indicated when the scan is initiated) instead of being called for a  
> single row. This would be either a configuration option on  
> openCursor or a new method on the cursor for

This is an interesting direction in my opinion. I would like to  
explore this further, although it also appears suitable for an  
evolution of the API. I think, though, that it would require a  
different interface than IDBCursor, since that produces a key and a  
value at a time.

>
>
> 9. API
>
> a.       DatabaseSync.createIndex: what's the default for the unique  
> argument?

It should be added. This value is false.

> b.      DatabaseSync.createObjectStore: what's the default for  
> autoIncrement?

It should be added. This value is false.

> c.       DatabaseSync.openObjectStore: what's the default for mode?

It should be added. This value is IDBObjectStore.READ_WRITE.

> d.      DatabaseSync.transaction: what's the units for the timeout  
> value? Seconds? Is there a value that means "infinite"?

Milliseconds. The lack of a timeout value in this call indicates a  
timeout limited only by the system's maximum timeout, which is  
implementation dependent.

> e.      ObjectStoreSync.get: see 7.d (return undefined instead of  
> throwing an exception)

Please see my comments on this above.

> f.        ObjectStoreSync: what happens to the reference if the  
> underlying store is deleted through another connection? We propose  
> it's ok to alter underlying objects in general and "visible" objects  
> should be ready and start failing when the objects they surface go  
> away or are altered.

The spec does not manage integrity constraints. It does what you  
expect and fails if the read operation on an index cannot find the  
referenced object.

> g.       CursorSync.openCursor: does the cursor start on the first  
> record or before the first record? Should probably be before the  
> first record so the first call to continue() can return false for  
> empty stores, moving straight from BOF to EOF.

Cursor starts on the first record. The call to continue is not  
required until after you are done with the first value. The call to  
continue should not be required, if you are going to only read the  
first value in a cursor.

> h.      CursorSync.count: what scenario does this enable? Also, name  
> is misleading; should be sameKeyCount or something that indicates  
> it's the count only of the rows that share the current key.

The key count is easier to implement than maintaining or calculating  
the count of records in an object store or across a key range.  
However, it is not as interesting as the approximate number of records  
in a key range in a given database object. Given that, I am willing to  
consider treating count as what it alludes to - the approximate number  
of records in the cursor.

> i.         CursorSync.value: when the cursor is over an index,  
> shouldn't the value be read-only as changing it would make it  
> inconsistent with the object store this index is for?

Changing the index, when the index is auto-populated, would make it  
inconsistent with the object store. However, integrity constraints are  
not enforced, so this will not be a problem. In case of auto-populated  
indexes, changing and index record is not allowed. I will update the  
text so this is clear.

> j.        CursorSync.continue(): does it return false when it  
> reaches the last record or when it's called *on* the last record and  
> moves to EOF (effectively moved past the last record)? If it's  
> sitting in EOF, does it "see" new inserts? (we assume not)

It returns false when it is called on the last record and moves to  
EOF. Inserts are not possible on a cursor.

> k.       CursorSync.delete(): "delete" causes trouble, should be  
> "remove"

Gotcha.

> l.         CursorSync.delete(): what happens to the cursor position  
> after this function returns? One option would be to leave the cursor  
> on the deleted row, and fail all access attempts so only continue()  
> can be called.

Exactly. That is the intended behavior. The text explaining this was  
lost in the most recent WD.

> m.    IndexSync: the put/delete methods seem to enable users to  
> modify the index independently of the store, making them  
> inconsistent. Given that the only kind of index described is auto- 
> populated, it doesn't seem appropriate to have these.

An index may not be auto-populated. See earlier responses.

> n.    Should we consider introducing an API that given an object and  
> a store returns the key to that object? that would avoid the need  
> for knowing the exact algorithm used to obtain the key from an  
> object + path.

I would like to put that in the parking lot for now.

>
>
> 10.       API (async specifics)
>
> a.       Currently the async API is only available on the window  
> object and not to workers. Libraries are likely to target only one  
> mode, in particular async, to work across all scenarios. So it would  
> be important to have async also in workers.

I would be willing to edit this portion of the requirements, only once  
we have a stable API for the rest of the spec.

> b.      DBRequest.abort(): it may not be possible to guarantee abort  
> in all phases of execution, so this should be described as a "best  
> effort" method; onsuccess would be called if the system decided to  
> proceed and complete the operation, and onerror if abort succeeded  
> at stopping the operation (with proper code indicating the error is  
> due to an explicit abort request). In any case ready state should go  
> do done.

Will clarify. The ready state should go to DONE if the request  
completes and to INITIAL if it successfully aborted.

> c.       The pattern where there is a single request object (e.g.  
> indexedDB.request) prevents user code from having multiple  
> outstanding requests against the same object (e.g. multiple ‘open'  
> or multiple ‘openCursor' requests). An alternate pattern that does  
> not have this problem would be to return the request object from the  
> method (e.g. from ‘open').

I will address this in a separate email.

> d.      CursorRequest.continue(): this seems to break the pattern  
> where request.result has the result of the operation; for continue  
> the operation (in the sync version) is true/false depending on  
> whether the cursor reached EOF. So in async request.result should be  
> the true/false value, the value itself would be available in the  
> cursor's "value" property,  and the success callback would be called  
> instead of the error one.

CursorRequest does carry the result of performing an operation on the  
cursor, i.e., continue. I think we are both agreeing on what the value  
of request.result ought to be.

>
>
> 11. API Names
>
> a.       "transaction" is really non-intuitive (particularly given  
> the existence of currentTransaction in the same class).  
> "beginTransaction" would capture semantics more accurately.

Propose openTransaction() to be consistent

> b.      ObjectStoreSync.delete: delete is a Javascript keyword, can  
> we use "remove" instead?

Yes

>
>
> 12. Object names in general
>
> a.       For database, store, index and other names in general, the  
> current description in various places says "case sensitive". It  
> would be good to be more specific and indicate "exact match" of all  
> constructs (e.g. accents, kana width). Binary match would be very  
> restrictive but a safe target. Alternatively we could just leave  
> this up to each implementation, and indicate non-normatively what  
> would be safe pattern of strings to use.

Prefer to perform UTF-8 comparison.

>
>
> 13. Editorial notes
>
> a.      Ranges: left-right versus start-end. "bound" versus "closed"  
> for intervals.

The terms are well defined in mathematics and unlikely to cause  
confusion. See [2]

> b.      Ranges: bound, "Create a new right-bound key range." ->  
> right & left bound

Correct.

> c.       3.2.7 obejct -> object

Gotcha

> d.      The current draft fails to format in IE, the script that  
> comes with the page fails with an error

I am aware of this and am working with the maintainer of ReSpec.js  
tool to publish an editor's draft that displays in IE.  Would it be OK  
if this editor's draft that works in IE is made available at an  
alternate W3C URL?

[1] http://lists.w3.org/Archives/Public/public-webapps/2009JulSep/0240.html
[2] http://en.wikipedia.org/wiki/Interval_%28mathematics%29#Terminology
Received on Monday, 1 February 2010 07:34:59 UTC