Re: [IndexedDB] Computed indexes from Jeremy Orlow on 2010-06-24 (public-webapps@w3.org from April to June 2010)

From: Jeremy Orlow <jorlow@chromium.org>
Date: Thu, 24 Jun 2010 15:01:57 +0100
To: Jonas Sicking <jonas@sicking.cc>
Cc: Webapps WG <public-webapps@w3.org>
Message-ID: <AANLkTilYNKjYMfQIBKjZ5WFGMzs0802IIpZMkjpseB-4@mail.gmail.com>
On Sat, Jun 19, 2010 at 9:05 AM, Jonas Sicking <jonas@sicking.cc> wrote:

> On Fri, Jun 18, 2010 at 7:42 PM, Jeremy Orlow <jorlow@chromium.org> wrote:
> > On Thu, Jun 17, 2010 at 3:25 PM, Jonas Sicking <jonas@sicking.cc> wrote:
> >>
> >> Hi All,
> >>
> >> We've debated a bit use cases like storing objects like:
> >>
> >> { name: "Elvis", born: "January 8, 1935", died: "August 16, 1977" }
> >> { name: "Gustav III", born: "24 January 1746", died: "29 March 1792" }
> >>
> >> And create an index based on the age at time of death. Similarly,
> >> store HTML documents and index on the URLs of the outgoing links in
> >> the documents.
> >>
> >> The way to solve this use case in the current draft is by creating an
> >> index without a keyPath and then manually insert things into that
> >> index as needed. However this has some significant downsides:
> >>
> >> * First of all it's more manual labor for the author to everywhere
> >> where an entry is added also add things to the index.
> >> * It's a bit unclear who is responsible for removing things from the
> >> index. The spec currently says that it's forbidden to add things to
> >> the index which doesn't have a corresponding entry in the objectStore,
> >> however it doesn't forbid removing an entry from the objectStore which
> >> has entries in the index pointing to it. But it also doesn't say that
> >> entries in the index are automatically removed when an entry in the
> >> objectStore is removed.
> >> * If the author is responsible for removing things from the index,
> >> then this could mean having to compute all the index values both on
> >> insertion and on removal into the objectStore.
> >> * Unless the author is prepared to add a lot of logic to his code, it
> >> means that all indexes have to be immediately updated whenever an
> >> insertion/removal takes place. This can be very suboptimal if the
> >> objectStore is modified several times between accesses to a given
> >> index. Compare this to indexes that use a keyPath where the
> >> implementation could lazily only insert things into an index when the
> >> index is accessed.
> >> * It requires all code that inserts and modifies an objectStore to be
> >> aware of all the keyPath-less indexes attached to the objectStore and
> >> how to update them.
> >
> > You missed one: if your index is non-unique, then the only way to update
> it
> > is by using a cursor...which is even more async calls for every update
> > and/or much more complicated logic to batch them up.
>
> Indeed.
>
> >> We have talked about two solutions to this problem so far:
> >>
> >> Upon calls to .add/.put/.update can require that the author passes in
> >> lists of values which are to be used as keys in the various indexes
> >> that are attached to the objectStore. This solves some of the problems
> >> mentioned above, but most of them are not addressed.
> >
> > Which ones are not addressed?  I'm not sure I agree.
>
> The following ones would seem to still apply:
>
> * First of all it's more manual labor for the author to everywhere
> where an entry is added also add things to the index.
> (Though the labor is significantly simpler and should result in much less
> code)
>

Agreed, assuming you can do everything with keyPath that you could do
manually with this implementation.


> * Unless the author is prepared to add a lot of logic to his code, it
> means that all indexes have to be immediately updated whenever an
> insertion/removal takes place. This can be very suboptimal if the
> objectStore is modified several times between accesses to a given
> index. Compare this to indexes that use a keyPath where the
> implementation could lazily only insert things into an index when the
> index is accessed.
> (actually, this one applies even more as it is impossible to comput
> the index values lazily as the have to be supplied on
> insertion/update)
>

I agree, though I think we have _much_ bigger fish to fry for v1.  Enough
so, that I'd actually like to see some hard examples of where updating the
indexes really becomes a performance concern.  My guess is that by the time
we add new features and tune what we have in IndexedDB to the point that
this becomes a concern, things will have changed enough that we'll have
different ideas on how best to optimize this case.  I feel pretty strongly
that we should not worry about this for now.


> * It requires all code that inserts and modifies an objectStore to be
> aware of all the keyPath-less indexes attached to the objectStore and
> how to update them.
>

This is definitely unfortunate.


>> We could allow a more complex expression to be used which would be
> >> evaluated for each entry in the objectStore and which produce the set
> >> of keys to be used for that index. This solves all the problems
> >> mentioned above, but has some problems of its own. In particular:
> >>
> >> 1. We can't allow a JS function to be passed in as that would pull in
> >> all the scope of that function which generally has a lifetime much
> >> shorter than the index. Thus we would require that the expression is
> >> passed in the form of a string which is somewhat ugly.
> >> 2. We have to define edge cases around what happens if the expression
> >> modifies the passed in value.
> >> 3. There are performance concerns if we want to allow lazy population
> >> of indexes since objects will have to be deserialized from the
> >> objectStore in order to be used by the expression
> >
> > Really?  Why would people be doing this lazily?  It seems like the next
> time
> > you access the array, you have to do this work.  I really don't see what
> > you'd save by doing this lazily.  Am I missing something?
>
> I assume that by "array" you mean "index", right?
>
> If you have an index that is only rarely used, much more rarely than
> you made modifications to the table, then it makes a lot of sense to
> only update the index once data is being read from it.
>
> >> 4. Maciej expressed concern that this might make it impossible to
> >> expose IndexedDB to non-JS languages such as ObjectiveC
> >>
> >> Let me address these in order (for the purposes of this discussion
> >> I'll use a separate function to create one of these indexes called
> >> createExpressionIndex. I'd prefer to do the bike-shedding afterward if
> >> possible):
> >> While I think 1 is unfortunate, i don't think it's a big deal.
> >> Fortunately Javascript has the nice feature that it allows functions
> >> to be serialized into strings, which means that while you can't do
> >> myObjectStore.createExpressionIndex("myIndex", function(val) { return
> ....
> >> });
> >> you can do
> >> myObjectStore.createExpressionIndex("myIndex", (function(val) { return
> >> .... }).toString());
> >
> > Do most JS engines allow _any_ function to be stored away for later?
>
> If you mean if _any_ function can be serialized using toString(), then
> yes, I believe so. Though of course serializing loses all scope
> references so it's not guaranteed to work. And non-builtin functions.
> Functions like Array.sort or Document.getElementById can of course not
> be serialized.
>

This may be true of JavaScript and even many other interpreted languages,
but it's not true of many (most?) compiled languages.  How would we handle
this in the spec?  Would we say it's unsupported in v1 or say that such
implementations would define their own language (or use an existing
interpreted language) as they see fit?  I guess this isn't a deal breaker
since JavaScript really is the main target for this, but it's somewhat
unfortunate.

> If so, why doesn't the structured clone algorithm allow them?
>
> You'd have to ask the editor, but I can think of a couple of reasons:
>
> 1. This way the result of a structured clone is pure data which has
> security benefits. This way you can't XSS a page (as easily) by
> sending it an evilly formed function using postMessage.
> 2. See the loss of scoping above.
>
> > Maybe we could make
> > the global scope be empty when compiling (so variables can't be bound)
> and
> > executing the function?
>
> Yes, we absolutely should. Forgot to mention this.
>
> > What is the difference between an "expressionIndex" and a keyPath?  It
> seems
> > like they're doing about the same thing.  My proposal to allow keyPath to
> be
> > artibrary JavaScript and then have the value being inserted be the global
> > scope actually sounds almost identical to what you're doing....except
> more
> > in the relm of what JS engines already do (since it's much like an eval).
>
> It is about the same thing yes. Though presumably an implementation
> could optimize "evaluating" the keyPath heavily. Much more so than a
> generic expression.
>

Howso?  Any implementation should be able to cache the compiled/JITed form
of the string passed into it's internal implementation of eval.

Allowing keyPath to be an arbitrary expression turns out to result in
> weird edge cases once you define all details. Say that you're storing
> the value:
>
> { foo: 5, bar: 10 }
>
> You want to evaluate the expression in a context where "foo" and "bar"
> are in scope. But you also don't want to evaluate with the value as
> the global object as then "a = 5" would modify the value as that sets
> a global variable. I.e. something like "t = foo*foo; bar + t;" would
> add a 't' property to value. And you also don't want to evaluate the
> expression in a context where the value is the current scope as then
> "var a = 5" would modify the value.
>
> Additionally, you have to define what value is returned. We could do
> like 'eval' is doing and say that the last evaluated expression is
> what is returned. However eval doesn't let you return early using the
> 'return' statement which might be practical.
>
> These problems could possibly be overcome though. I'll have to check
> with people that know JS better than me. Ideas welcome.
>

You're right that simply making it the global scope won't work unless we
make a copy of it the global scope or do copy-on-write.  The former being
heavy-weight at run time and the latter being a headache implementation
wise.  That said, for 98% of the cases (where the expression is simple),
having the key path just be an expression would be simpler.  One idea that
just came to mind is what if we supported both (based on whether a string or
function were passed in)...?


> >> 2 might not be a big issue if the implementation is using lazy index
> >> population anyway. In this case the implementation will have to
> >> deserialize the value from the objectStore anyway which means that any
> >> modifications to the value won't affect any other indexes or what is
> >> stored in the database.
> >
> > But it forces you to create a clone before hand...or somehow implement
> copy
> > on write semantics.
>
> Deserializing effectively creates a clone, yes.
>

The point is that creating a copy will largely negate any
performance benefits you would get unless the algorithm for calculating the
index is very large and the amount of data being used for the calculation is
very small.

>> 3 is a very valid concern. We'll have to do measurements on how big of
> >> an issue this is, but my gut instinct is that deserializing can be
> >> made pretty fast.
> >>
> >> There are several solutions to 4. The simplest one is to simply expose
> >> the exact same API to ObjectiveC as is exposed to javascript. I.e. if
> >> ObjectiveC calls the createExpressionIndex it too will have to pass in
> >> a string which contains a javascript expression which will be used to
> >> populate indexes. When ObjectiveC simply reads data through an index
> >> it is not affected by javascript executing behind the scenes.
> >> Alternatively, if you want to keep things as pure-ObjectiveC, then we
> >> can define a WebIDL type which maps to a string in the javascript
> >> bindings, and to a callback in ObjectiveC.
> >
> > I don't quite understand you here, but forcing all IndexedDB
> implementations
> > to depend on JavaScript seems to be a non-starter.  Having any language
> take
> > in its own native function is an interesting idea, but then each
> IndexedDB
> > database is tied to whatever language it was created in which seems a
> little
> > odd--but maybe OK in practice.
>
> Indeed.
>
> >> All in all I think this is a pretty cool solution. My main concern is
> >> 3 listed above, but I'd like to measure to see if this is a really big
> >> problem before tossing this idea out.
> >>
> >> Please let me know what you think.
> >
> > I only see superficial differences between this and my keyPath proposals
> > that I made a while ago.  Am I missing something?
>
> Not at all, that's why I said "We have talked about two solutions so far"
> :)
>
> We didn't really come to a conclusion in the previous thread. That's
> why I wanted to collect all the information into this thread in an
> attempt to make a decision.
>

In that case, I feel like you should have put more effort into explaining
previous ideas and comparing/contrasting your idea with them.  Especially
since many of the potential problems you brought up here were ones you
brought up in the previous thread, and many of the reasons why you're not
worried about them also apply to the suggestion in the previous thread.



On Thu, Jun 24, 2010 at 12:11 AM, Jonas Sicking <jonas@sicking.cc> wrote:

> On Sat, Jun 19, 2010 at 1:05 AM, Jonas Sicking <jonas@sicking.cc> wrote:
> >> Maybe we could make
> >> the global scope be empty when compiling (so variables can't be bound)
> and
> >> executing the function?
> >
> > Yes, we absolutely should. Forgot to mention this.
> >
> >> What is the difference between an "expressionIndex" and a keyPath?  It
> seems
> >> like they're doing about the same thing.  My proposal to allow keyPath
> to be
> >> artibrary JavaScript and then have the value being inserted be the
> global
> >> scope actually sounds almost identical to what you're doing....except
> more
> >> in the relm of what JS engines already do (since it's much like an
> eval).
> >
> > It is about the same thing yes. Though presumably an implementation
> > could optimize "evaluating" the keyPath heavily. Much more so than a
> > generic expression.
> >
> > Allowing keyPath to be an arbitrary expression turns out to result in
> > weird edge cases once you define all details. Say that you're storing
> > the value:
> >
> > { foo: 5, bar: 10 }
> >
> > You want to evaluate the expression in a context where "foo" and "bar"
> > are in scope. But you also don't want to evaluate with the value as
> > the global object as then "a = 5" would modify the value as that sets
> > a global variable. I.e. something like "t = foo*foo; bar + t;" would
> > add a 't' property to value. And you also don't want to evaluate the
> > expression in a context where the value is the current scope as then
> > "var a = 5" would modify the value.
> >
> > Additionally, you have to define what value is returned. We could do
> > like 'eval' is doing and say that the last evaluated expression is
> > what is returned. However eval doesn't let you return early using the
> > 'return' statement which might be practical.
> >
> > These problems could possibly be overcome though. I'll have to check
> > with people that know JS better than me. Ideas welcome.
>
> I talked to some javascript Gurus here and came up with this proposal:
>
> The string passed in is passed to the javascript Function constructor
> in the following manner:
>
> f = new Function("value", passedInExpressionString);
> Object.freeze(f);
>
> The resulting function is then executed for each value in the object
> store, and the returned value is used as key in the index. If the
> returned value is undefined, or if an exception is thrown, no entry is
> added to the index. The function is executed in a global scope which
> also is frozenand has no available properties other than the ones
> defined by the ECMAScript standard. I.e. things like document,
> XMLHttpRequest and setTimeout is not available.
>
> Here are some example expressions:
> Index on the result of adding two properties together:
> "return value.a + value.b;"
>
> Index on the number of days between two dates:
> "return (value.to - value.from) / 1000*60*60*24;"
>
> Index on a persons last name:
> "var names = value.name.split(/\\s*/); return names[names.length - 1];"
>
> The reason to freeze the global object and the function is to prevent
> the expression from retaining state from one execution to the next. We
> could also say that a fresh function object and a fresh global scope
> object is created, however that would result in slower execution.
>

It seems like this is something that could be optimized by implementations
if it does become an issue though.  Right?

So what your proposing is that the keyPath would essentially be a string of
the body of a function which runs for every index (on that objectStore) for
every value inserted into that object store?  This seems like half way
between the eval-like idea I mentioned earlier.  It certainly seems to have
advantages for complex keyPaths, but I'm still not so hot on having
boilerplate/assumptions (like needing "return" and assuming "value" is
present) present in every single keyPath.  Especially when the use cases
(while important) don't seem to be the common case.  (In fact, can you even
do this in SQL?  If not, I think it's pretty strong evidence against needing
to do arbitrary calculations in a keyPath.)

I'm not a JavaScript (or JS engine) expert, but I find it difficult to
believe that what you're proposing is much if any more optimizable than the
(for lack of better name) eval-keyPath-proposal.  And as long as there are
fundamental reasons why one can't be optimized, implementational challenges
really shouldn't impact how we design the API.  At all.

Anyway, I'll continue thinking about this, but I'm far from convinced that
allowing javascript (whether it's like the body of a function or like it's
evaled) is necessary for v1.  It adds a lot of complexity and I honestly
doubt we understand the problems well enough at the moment to come up with
the best solution.  I'd much rather punt for now.  But if you feel strongly
we need this for v1, I think we need to find some examples in SQL or other
database engines to base our implementation on.  Otherwise we're building an
API on top of a tower of assumptions.

J
Received on Thursday, 24 June 2010 14:02:55 UTC