Re: [IndexedDB] Detailed comments for the current draft from Jeremy Orlow on 2010-02-01 (public-webapps@w3.org from January to March 2010)

From: Jeremy Orlow <jorlow@chromium.org>
Date: Mon, 1 Feb 2010 01:29:48 -0800
To: Nikunj Mehta <nikunj@o-micron.com>
Cc: Pablo Castro <Pablo.Castro@microsoft.com>, "public-webapps@w3.org" <public-webapps@w3.org>
Message-ID: <5dd9e5c51002010129w309d11d6n376d5faba57b9109@mail.gmail.com>
On Sun, Jan 31, 2010 at 11:33 PM, Nikunj Mehta <nikunj@o-micron.com> wrote:

>
> On Jan 26, 2010, at 12:47 PM, Pablo Castro wrote:
>
>  These are notes that we collected both from reviewing the spec (editor's
>> draft up to Jan 24th) and from a prototype implementation that we are
>> working on. I didn't realize we had this many notes, otherwise I would have
>> been sending intermediate notes early. Will do so next round.
>>
>>
>> 1. Keys and sorting
>>
>> a.       3.1.1:  it would seem that having also date/time values as keys
>> would be important and it's a common sorting criteria (e.g. as part of a
>> composite primary key or in general as an index key).
>>
>
> The Web IDL spec does not support a Date/Time data type. Could your use
> case be supported by storing the underlying time with millisecond precision
> using an IDL long long type? I am willing to change the spec so that it
> allows long long instead of long IDL type, which will provide adequate
> support for Date and time sorting.


Can the spec not be augmented?  It seems like other specs like WebGL have
created their own types.  If not, I suppose your suggested change would
suffice as well.  This does seem like an important use case.


>  b.      3.1.1: similarly, sorting on number in general (not just
>> integers/longs) would be important (e.g. price lists, scores, etc.)
>>
>
> I am once again hampered by Web IDL spec. Is it possible to leave this for
> future versions of the spec?
>
>
>  c.       3.1.1: cross type sorting and sorting of long values are clear.
>> Sorting of strings however needs more elaboration. In particular, which
>> collation do we use? Does the user or developer get to choose a collation?
>> If we pick up a collation from the environment (e.g. the OS), if the
>> collation changes we'd have to re-index all the databases.
>>
>
> I propose to use Unicode collation algorithm, which was also suggested by
> Jonas during a conversation.
>
>
>  d.      3.1.3: spec reads "…key path must be the name of an enumerated
>> property…"; how about composite keys (would make the related APIs take a
>> DOMString or DOMStringList)
>>
>
> I prefer to leave composite keys to a future version.
>
>
>
>>
>> 2. Values
>>
>> a.       3.1.2: isn't the requirement for "structured clones" too much? It
>> would mean implementations would have to be able to store and retrieve File
>> objects and such. Would it be more appropriate to say it's just graphs of
>> Javascript primitive objects/values (object, string, number, date, arrays,
>> null)?
>>
>
> Your list leaves out File, Blob, FileList, ImageData, and RegExp types.
> While I don't feel so strongly about all these types, I believe that support
> for Blob/File and ImageData will be beneficial to those who work with
> browsers. Instead of profiling this algorithm, I think it is best to just
> require the same algorithm.
>
>
>
>>
>> 3. Object store
>>
>> a.       3.1.3: do we really need in-line + out-of-line keys? Besides the
>> concept-count increase, we wonder whether out-of-line keys would cause
>> trouble to generic libraries, as the values for the keys wouldn't be part of
>> the values iterated when doing a "foreach" over the table.
>>
>
> Certainly it is a matter of prioritizing among various requirements.
> Out-of-line keys enable people to store simple persistent hash maps. I think
> it would be wrong to require that data be always stored as objects. A
> library can always elide the availability of out-of-line keys if that poses
> a problem to its users.
>
>
>  b.      Query processing libraries will need temporary stores, which need
>> temporary names. Should we introduce an API for the creation of temporary
>> stores with transaction lifetime and no name?
>>
>
> Firstly, I think we can leave this safely to a future version. Secondly, my
> suggestion would be to provide a parameter to the create call to indicate
> that an object store being created is a transient one, i.e., not backed by
> durable storage. They could be available across different transactions. If
> your intention is to not make these object stores unavailable across
> connections, then we can also offer a connection-specific transient object
> store.
>
> In general, it requires us to introduce the notion of create params, which
> would simplify the evolution of the API. This is also similar to how
> Berkeley DB handles various options, not just those related to creation of a
> Berkeley "database".
>
>
>  c.      It would be nice to have an estimate row count on each store. This
>> comes at an implementation and runtime cost. Strong opinions? Lacking
>> everything else, this would be the only statistic to base decisions on for a
>> query processor.
>>
>
> I believe we need to have a general way of estimating the number of records
> in a cursor once a key range has been specified. Kris Zyp also brings this
> up in a separate email. I am willing to add an estimateCount attribute to
> IDBCursor for this.
>
>
>  d.      The draft does not touch on how applications would do optimistic
>> concurrency. A common way of doing this is to use a timestamp value that's
>> automatically updated by the system every time someone touches the row.
>> While we don't feel it's a must have, it certainly supports common
>> scenarios.
>>
>
> Do you strongly feel that the manner in which optimistic concurrency is
> performed needs to be described in this spec? I don't.
>
>
>
>>
>> 4. Indexes
>>
>> a.       3.1.4 mentions "auto-populated" indexes, but then there is no
>> mention of other types. We suggest that we remove this and in the algorithms
>> section describe side-effecting operations as always updating the indexes as
>> well.
>>
>
> The idea is that an index is either auto-populated or not. If it is not
> auto-populated, it must be managed explicitly. This was a requirement we
> discussed to support complex cases such as composite keys.


Can you elaborate?  I don't understand what you mean by this.  And I agree
with Pablo that indexes that are not auto-populated do not seem like a
priority for the first version of the spec.  It seems like they add
complexity to the API and make maintaining database consistency more
difficult without any major benefit to the user....but maybe I'm missing
something?


> I am reluctant to remove this feature. I can certainly clean it up so
> things are clearer.
>
>
>  b.      If during insert/update the value of the key is not present (i.e.
>> undefined as opposite to null or a value), is that a failure, does the row
>> not get indexed, or is it indexed as null? Failure would probably cause a
>> lot of trouble to users; the other two have correctness problems. An option
>> is to index them as undefined, but now we have undefined and null as
>> indexable keys. We lean toward this last option.
>>
>
> I haven't seen enough application experience around this to suggest that
> treating undefined as null would be the right thing to do. Unfortunately,
> creating a little bit of trouble for programmers to handle their use of
> undefined keys seems like the only safe thing to do.
>
>
>  5.       Databases
>> a.       Not being able to enumerate database gets in the way of creating
>> good tools and frameworks such as database explorers. What was the
>> motivation for this? Is it security related?
>>
>
> Database explorers are best designed in the browser as an add-on or
> development tool. This would require additional interfaces not available to
> applications. This approach is consistent with usage experience around SQL
> databases and the database explorer built-in to Safari.


Is there any reason not to though?  I agree with Pablo that it'd be nice to
expose if there's no specific reason not to.


>  b.      Clarification on transactions: all database operations that
>> affect the schema (create/remove store/index, setVersion, etc.) as well as
>> data modification operations are assumed to be auto-commit by default,
>> correct? Furthermore, all those operations (both schema and data) can happen
>> within a transaction, including mixing schema and data changes. Does that
>> line up with others' expectations? If so we should find a spot to articulate
>> this explicitly.
>>
>
> The auto-commit mode, per my intention, is when an IDBDatabase object
> doesn't have a currentTransaction set. Is that what you meant?
>
> Moreover, in 3.2.9 I intended to allow the database itself to be identified
> as an object to be reserved for isolation from other transactions (in
> addition to the object stores and indexes). I can improve the spec text
> around this. This allows transactions in any of the three isolation modes to
> be used for schema operations in conjunction with data modification
> operations.
>
>
>  c.       No way to delete a database? It would be reasonable for
>> applications to want to do that and let go of the user data (e.g. a "forget
>> me" feature in a web site)
>>
>
> There is currently no way to delete a database through an API. I can
> clarify this further, if needed. Of course, user interfaces can be developed
> to remove a database just like a cookie can be removed. Also, this style is
> similar to the approach taken in SQL database. Are there particular use
> cases that require programmatic ability to remove databases?


When a website transitions from one schema to another, it seems as though
they might want to delete old cruft in a way that's simple but resilient.
 As with enumerating databases, I don't understand what reason there is for
making this so opaque to the user.


>  6.       Transactions
>> a.       While we understand the goal of simplifying developers' life with
>> an error-free transactional model, we're not sure if we're making more harm
>> by introducing more concepts into this space. Wouldn't it be better to use
>> regular transactions with a well-known failure mode (e.g. either deadlocks
>> or optimistic concurrency failure on commit)?
>>
>
> There has been prior discussion about this in the WG. I would suggest
> reading the thread on this [1]. I would be interested to see new
> implementation experience that either refutes or further supports a
> particular argument in that thread.
>
>
>  b.    If in auto-commit mode, if two cursors are opened at the same time
>> (e.g. to scan them in an interleaved way), are they in independent
>> transactions simultaneously active in the same connection?
>>
>
> In the case of auto-commit, there will not be simultaneous transactions,
> because each modification commits before any subsequent modification can
> occur.
>
>
>
>>
>> 7. Algorithms
>>
>> a.       3.2.2: steps 4 and 5 are inverted in order.
>>
>
> Agreed.
>
>
>  b.      3.2.2: when there is a key generator and the store uses in-line
>> keys, should the generated key value be propagated to the original object
>> (in addition to the clone), such that both are in sync after the put
>> operation?
>>
>
> This appears to be a thing that can easily be done by using the return
> value from the algorithm. I would like to not modify the object received to
> the extent possible.
>
>
>  c.       3.2.3: step 2, probably editorial mistake? Wouldn't all indexes
>> have a key path?
>>
>
> Nope. An index that is not auto-managed will not have a key path. See 3.1.4
>
>
>  d.      3.2.4.2: in our experiments writing application code, the fact
>> that this method throws an exception when an item is not found is quite
>> inconvenient. It would be much natural to just return undefined, as this can
>> be a primary code path (to not find something) and not an exceptional
>> situation. Same for 3.2.5, step 2 and 3.2.6 step 2.
>>
>
> I am not comfortable specifying the API to be dependent on the separation
> between undefined and null. Since null is a valid return value, it doesn't
> make sense to return that either. The only safe alternative appears to be to
> throw an error.
>
> As a means of improving usability, I propose adding another method "exists"
> which takes the same arguments as "get" and returns true or false. If a
> program doesn't know for sure whether a key exists in the database, it can
> use the exists method to avoid an exception.
>
>
>  e.      The algorithm to put a new object into a store currently indicates
>> that the key of the object should be returned. How about other values that
>> may be generated by the store? For example, if the store generates
>> timestamps (not currently in the draft, but may be needed for optimistic
>> concurrency control), how would be return them? should we update the actual
>> object that was passed as a parameter with keys and other server-generated
>> values?
>>
>
> It will only be possible to return one value from a call. Given that the
> domain we are in is key-value databases, it makes sense to return a
> generated key from a call to store a value.
>
>
>
>>
>> 8. Performance and API style
>>
>> a.       The async nature of the API makes regular scans very heavy on
>> callbacks (one per row plus completion/error callbacks). This slows down
>> scans a lot, so when doing a multiple scans (e.g. a reasonably complicated
>> query that has joins, sorts and filters) performance will be bound by this
>> even if everything else happens really fast. It would be interesting to
>> support a block-fetch mode where the callback gets called for a number of
>> buffered rows (indicated when the scan is initiated) instead of being called
>> for a single row. This would be either a configuration option on openCursor
>> or a new method on the cursor for
>>
>
> This is an interesting direction in my opinion. I would like to explore
> this further, although it also appears suitable for an evolution of the API.
> I think, though, that it would require a different interface than IDBCursor,
> since that produces a key and a value at a time.


I'm glad you're looking into this as I agree it's a pretty major problem in
the current API.


>  9. API
>>
>> a.       DatabaseSync.createIndex: what's the default for the unique
>> argument?
>>
>
> It should be added. This value is false.
>
>
>  b.      DatabaseSync.createObjectStore: what's the default for
>> autoIncrement?
>>
>
> It should be added. This value is false.
>
>
>  c.       DatabaseSync.openObjectStore: what's the default for mode?
>>
>
> It should be added. This value is IDBObjectStore.READ_WRITE.
>
>
>  d.      DatabaseSync.transaction: what's the units for the timeout value?
>> Seconds? Is there a value that means "infinite"?
>>
>
> Milliseconds. The lack of a timeout value in this call indicates a timeout
> limited only by the system's maximum timeout, which is implementation
> dependent.
>
>
>  e.      ObjectStoreSync.get: see 7.d (return undefined instead of throwing
>> an exception)
>>
>
> Please see my comments on this above.
>
>
>  f.        ObjectStoreSync: what happens to the reference if the underlying
>> store is deleted through another connection? We propose it's ok to alter
>> underlying objects in general and "visible" objects should be ready and
>> start failing when the objects they surface go away or are altered.
>>
>
> The spec does not manage integrity constraints. It does what you expect and
> fails if the read operation on an index cannot find the referenced object.
>
>
>  g.       CursorSync.openCursor: does the cursor start on the first record
>> or before the first record? Should probably be before the first record so
>> the first call to continue() can return false for empty stores, moving
>> straight from BOF to EOF.
>>
>
> Cursor starts on the first record. The call to continue is not required
> until after you are done with the first value. The call to continue should
> not be required, if you are going to only read the first value in a cursor.
>
>
>  h.      CursorSync.count: what scenario does this enable? Also, name is
>> misleading; should be sameKeyCount or something that indicates it's the
>> count only of the rows that share the current key.
>>
>
> The key count is easier to implement than maintaining or calculating the
> count of records in an object store or across a key range. However, it is
> not as interesting as the approximate number of records in a key range in a
> given database object. Given that, I am willing to consider treating count
> as what it alludes to - the approximate number of records in the cursor.


The key count does seem interesting...but I agree that it should be named
something other than "count".  If there's a total number of records in the
cursor, it should probably be an estimate and labeled as mentioned above.


 i.         CursorSync.value: when the cursor is over an index, shouldn't
>> the value be read-only as changing it would make it inconsistent with the
>> object store this index is for?
>>
>
> Changing the index, when the index is auto-populated, would make it
> inconsistent with the object store. However, integrity constraints are not
> enforced, so this will not be a problem. In case of auto-populated indexes,
> changing and index record is not allowed. I will update the text so this is
> clear.


I don't see why indexes without integrity constraints would be important
enough to include for the first version.  They seem to add unnecessary
complexity.

 j.        CursorSync.continue(): does it return false when it reaches the
>> last record or when it's called *on* the last record and moves to EOF
>> (effectively moved past the last record)? If it's sitting in EOF, does it
>> "see" new inserts? (we assume not)
>>
>
> It returns false when it is called on the last record and moves to EOF.
> Inserts are not possible on a cursor.
>
>
>  k.       CursorSync.delete(): "delete" causes trouble, should be "remove"
>>
>
> Gotcha.
>
>
>  l.         CursorSync.delete(): what happens to the cursor position after
>> this function returns? One option would be to leave the cursor on the
>> deleted row, and fail all access attempts so only continue() can be called.
>>
>
> Exactly. That is the intended behavior. The text explaining this was lost
> in the most recent WD.
>
>
>  m.    IndexSync: the put/delete methods seem to enable users to modify the
>> index independently of the store, making them inconsistent. Given that the
>> only kind of index described is auto-populated, it doesn't seem appropriate
>> to have these.
>>
>
> An index may not be auto-populated. See earlier responses.
>
>
>  n.    Should we consider introducing an API that given an object and a
>> store returns the key to that object? that would avoid the need for knowing
>> the exact algorithm used to obtain the key from an object + path.
>>
>
> I would like to put that in the parking lot for now.
>
>
>
>>
>> 10.       API (async specifics)
>>
>> a.       Currently the async API is only available on the window object
>> and not to workers. Libraries are likely to target only one mode, in
>> particular async, to work across all scenarios. So it would be important to
>> have async also in workers.
>>
>
> I would be willing to edit this portion of the requirements, only once we
> have a stable API for the rest of the spec.


I strongly agree with Pablo on this one.


>  b.      DBRequest.abort(): it may not be possible to guarantee abort in
>> all phases of execution, so this should be described as a "best effort"
>> method; onsuccess would be called if the system decided to proceed and
>> complete the operation, and onerror if abort succeeded at stopping the
>> operation (with proper code indicating the error is due to an explicit abort
>> request). In any case ready state should go do done.
>>
>
> Will clarify. The ready state should go to DONE if the request completes
> and to INITIAL if it successfully aborted.
>
>
>  c.       The pattern where there is a single request object (e.g.
>> indexedDB.request) prevents user code from having multiple outstanding
>> requests against the same object (e.g. multiple ‘open' or multiple
>> ‘openCursor' requests). An alternate pattern that does not have this problem
>> would be to return the request object from the method (e.g. from ‘open').
>>
>
> I will address this in a separate email.
>
>
>  d.      CursorRequest.continue(): this seems to break the pattern where
>> request.result has the result of the operation; for continue the operation
>> (in the sync version) is true/false depending on whether the cursor reached
>> EOF. So in async request.result should be the true/false value, the value
>> itself would be available in the cursor's "value" property,  and the success
>> callback would be called instead of the error one.
>>
>
> CursorRequest does carry the result of performing an operation on the
> cursor, i.e., continue. I think we are both agreeing on what the value of
> request.result ought to be.
>
>
>
>>
>> 11. API Names
>>
>> a.       "transaction" is really non-intuitive (particularly given the
>> existence of currentTransaction in the same class). "beginTransaction" would
>> capture semantics more accurately.
>>
>
> Propose openTransaction() to be consistent
>
>
>  b.      ObjectStoreSync.delete: delete is a Javascript keyword, can we use
>> "remove" instead?
>>
>
> Yes
>
>
>
>>
>> 12. Object names in general
>>
>> a.       For database, store, index and other names in general, the
>> current description in various places says "case sensitive". It would be
>> good to be more specific and indicate "exact match" of all constructs (e.g.
>> accents, kana width). Binary match would be very restrictive but a safe
>> target. Alternatively we could just leave this up to each implementation,
>> and indicate non-normatively what would be safe pattern of strings to use.
>>
>
> Prefer to perform UTF-8 comparison.
>
>
>
>>
>> 13. Editorial notes
>>
>> a.      Ranges: left-right versus start-end. "bound" versus "closed" for
>> intervals.
>>
>
> The terms are well defined in mathematics and unlikely to cause confusion.
> See [2]
>
>
>  b.      Ranges: bound, "Create a new right-bound key range." -> right &
>> left bound
>>
>
> Correct.
>
>
>  c.       3.2.7 obejct -> object
>>
>
> Gotcha
>
>
>  d.      The current draft fails to format in IE, the script that comes
>> with the page fails with an error
>>
>
> I am aware of this and am working with the maintainer of ReSpec.js tool to
> publish an editor's draft that displays in IE.  Would it be OK if this
> editor's draft that works in IE is made available at an alternate W3C URL?
>
> [1]
> http://lists.w3.org/Archives/Public/public-webapps/2009JulSep/0240.html
> [2] http://en.wikipedia.org/wiki/Interval_%28mathematics%29#Terminology
>
Received on Monday, 1 February 2010 09:30:41 UTC