Re: [IndexedDB] Computed indexes from Jonas Sicking on 2010-06-19 (public-webapps@w3.org from April to June 2010)

From: Jonas Sicking <jonas@sicking.cc>
Date: Sat, 19 Jun 2010 01:05:44 -0700
To: Jeremy Orlow <jorlow@chromium.org>
Cc: Webapps WG <public-webapps@w3.org>
Message-ID: <AANLkTik9oGaISKI-GX3FrJU5aIyzZ2wDz6dUDVFfjpaL@mail.gmail.com>
On Fri, Jun 18, 2010 at 7:42 PM, Jeremy Orlow <jorlow@chromium.org> wrote:
> On Thu, Jun 17, 2010 at 3:25 PM, Jonas Sicking <jonas@sicking.cc> wrote:
>>
>> Hi All,
>>
>> We've debated a bit use cases like storing objects like:
>>
>> { name: "Elvis", born: "January 8, 1935", died: "August 16, 1977" }
>> { name: "Gustav III", born: "24 January 1746", died: "29 March 1792" }
>>
>> And create an index based on the age at time of death. Similarly,
>> store HTML documents and index on the URLs of the outgoing links in
>> the documents.
>>
>> The way to solve this use case in the current draft is by creating an
>> index without a keyPath and then manually insert things into that
>> index as needed. However this has some significant downsides:
>>
>> * First of all it's more manual labor for the author to everywhere
>> where an entry is added also add things to the index.
>> * It's a bit unclear who is responsible for removing things from the
>> index. The spec currently says that it's forbidden to add things to
>> the index which doesn't have a corresponding entry in the objectStore,
>> however it doesn't forbid removing an entry from the objectStore which
>> has entries in the index pointing to it. But it also doesn't say that
>> entries in the index are automatically removed when an entry in the
>> objectStore is removed.
>> * If the author is responsible for removing things from the index,
>> then this could mean having to compute all the index values both on
>> insertion and on removal into the objectStore.
>> * Unless the author is prepared to add a lot of logic to his code, it
>> means that all indexes have to be immediately updated whenever an
>> insertion/removal takes place. This can be very suboptimal if the
>> objectStore is modified several times between accesses to a given
>> index. Compare this to indexes that use a keyPath where the
>> implementation could lazily only insert things into an index when the
>> index is accessed.
>> * It requires all code that inserts and modifies an objectStore to be
>> aware of all the keyPath-less indexes attached to the objectStore and
>> how to update them.
>
> You missed one: if your index is non-unique, then the only way to update it
> is by using a cursor...which is even more async calls for every update
> and/or much more complicated logic to batch them up.

Indeed.

>> We have talked about two solutions to this problem so far:
>>
>> Upon calls to .add/.put/.update can require that the author passes in
>> lists of values which are to be used as keys in the various indexes
>> that are attached to the objectStore. This solves some of the problems
>> mentioned above, but most of them are not addressed.
>
> Which ones are not addressed?  I'm not sure I agree.

The following ones would seem to still apply:

* First of all it's more manual labor for the author to everywhere
where an entry is added also add things to the index.
(Though the labor is significantly simpler and should result in much less code)
* Unless the author is prepared to add a lot of logic to his code, it
means that all indexes have to be immediately updated whenever an
insertion/removal takes place. This can be very suboptimal if the
objectStore is modified several times between accesses to a given
index. Compare this to indexes that use a keyPath where the
implementation could lazily only insert things into an index when the
index is accessed.
(actually, this one applies even more as it is impossible to comput
the index values lazily as the have to be supplied on
insertion/update)
* It requires all code that inserts and modifies an objectStore to be
aware of all the keyPath-less indexes attached to the objectStore and
how to update them.

>> We could allow a more complex expression to be used which would be
>> evaluated for each entry in the objectStore and which produce the set
>> of keys to be used for that index. This solves all the problems
>> mentioned above, but has some problems of its own. In particular:
>>
>> 1. We can't allow a JS function to be passed in as that would pull in
>> all the scope of that function which generally has a lifetime much
>> shorter than the index. Thus we would require that the expression is
>> passed in the form of a string which is somewhat ugly.
>> 2. We have to define edge cases around what happens if the expression
>> modifies the passed in value.
>> 3. There are performance concerns if we want to allow lazy population
>> of indexes since objects will have to be deserialized from the
>> objectStore in order to be used by the expression
>
> Really?  Why would people be doing this lazily?  It seems like the next time
> you access the array, you have to do this work.  I really don't see what
> you'd save by doing this lazily.  Am I missing something?

I assume that by "array" you mean "index", right?

If you have an index that is only rarely used, much more rarely than
you made modifications to the table, then it makes a lot of sense to
only update the index once data is being read from it.

>> 4. Maciej expressed concern that this might make it impossible to
>> expose IndexedDB to non-JS languages such as ObjectiveC
>>
>> Let me address these in order (for the purposes of this discussion
>> I'll use a separate function to create one of these indexes called
>> createExpressionIndex. I'd prefer to do the bike-shedding afterward if
>> possible):
>> While I think 1 is unfortunate, i don't think it's a big deal.
>> Fortunately Javascript has the nice feature that it allows functions
>> to be serialized into strings, which means that while you can't do
>> myObjectStore.createExpressionIndex("myIndex", function(val) { return ....
>> });
>> you can do
>> myObjectStore.createExpressionIndex("myIndex", (function(val) { return
>> .... }).toString());
>
> Do most JS engines allow _any_ function to be stored away for later?

If you mean if _any_ function can be serialized using toString(), then
yes, I believe so. Though of course serializing loses all scope
references so it's not guaranteed to work. And non-builtin functions.
Functions like Array.sort or Document.getElementById can of course not
be serialized.

> If so, why doesn't the structured clone algorithm allow them?

You'd have to ask the editor, but I can think of a couple of reasons:

1. This way the result of a structured clone is pure data which has
security benefits. This way you can't XSS a page (as easily) by
sending it an evilly formed function using postMessage.
2. See the loss of scoping above.

> Maybe we could make
> the global scope be empty when compiling (so variables can't be bound) and
> executing the function?

Yes, we absolutely should. Forgot to mention this.

> What is the difference between an "expressionIndex" and a keyPath?  It seems
> like they're doing about the same thing.  My proposal to allow keyPath to be
> artibrary JavaScript and then have the value being inserted be the global
> scope actually sounds almost identical to what you're doing....except more
> in the relm of what JS engines already do (since it's much like an eval).

It is about the same thing yes. Though presumably an implementation
could optimize "evaluating" the keyPath heavily. Much more so than a
generic expression.

Allowing keyPath to be an arbitrary expression turns out to result in
weird edge cases once you define all details. Say that you're storing
the value:

{ foo: 5, bar: 10 }

You want to evaluate the expression in a context where "foo" and "bar"
are in scope. But you also don't want to evaluate with the value as
the global object as then "a = 5" would modify the value as that sets
a global variable. I.e. something like "t = foo*foo; bar + t;" would
add a 't' property to value. And you also don't want to evaluate the
expression in a context where the value is the current scope as then
"var a = 5" would modify the value.

Additionally, you have to define what value is returned. We could do
like 'eval' is doing and say that the last evaluated expression is
what is returned. However eval doesn't let you return early using the
'return' statement which might be practical.

These problems could possibly be overcome though. I'll have to check
with people that know JS better than me. Ideas welcome.

>> 2 might not be a big issue if the implementation is using lazy index
>> population anyway. In this case the implementation will have to
>> deserialize the value from the objectStore anyway which means that any
>> modifications to the value won't affect any other indexes or what is
>> stored in the database.
>
> But it forces you to create a clone before hand...or somehow implement copy
> on write semantics.

Deserializing effectively creates a clone, yes.

>> 3 is a very valid concern. We'll have to do measurements on how big of
>> an issue this is, but my gut instinct is that deserializing can be
>> made pretty fast.
>>
>> There are several solutions to 4. The simplest one is to simply expose
>> the exact same API to ObjectiveC as is exposed to javascript. I.e. if
>> ObjectiveC calls the createExpressionIndex it too will have to pass in
>> a string which contains a javascript expression which will be used to
>> populate indexes. When ObjectiveC simply reads data through an index
>> it is not affected by javascript executing behind the scenes.
>> Alternatively, if you want to keep things as pure-ObjectiveC, then we
>> can define a WebIDL type which maps to a string in the javascript
>> bindings, and to a callback in ObjectiveC.
>
> I don't quite understand you here, but forcing all IndexedDB implementations
> to depend on JavaScript seems to be a non-starter.  Having any language take
> in its own native function is an interesting idea, but then each IndexedDB
> database is tied to whatever language it was created in which seems a little
> odd--but maybe OK in practice.

Indeed.

>> All in all I think this is a pretty cool solution. My main concern is
>> 3 listed above, but I'd like to measure to see if this is a really big
>> problem before tossing this idea out.
>>
>> Please let me know what you think.
>
> I only see superficial differences between this and my keyPath proposals
> that I made a while ago.  Am I missing something?

Not at all, that's why I said "We have talked about two solutions so far" :)

We didn't really come to a conclusion in the previous thread. That's
why I wanted to collect all the information into this thread in an
attempt to make a decision.

/ Jonas
Received on Saturday, 19 June 2010 08:13:32 UTC