Re: Data models and filtering (was: Re: Contacts API draft) from Kis, Zoltan on 2013-01-18 (public-sysapps@w3.org from January 2013)

From: Kis, Zoltan <zoltan.kis@intel.com>
Date: Fri, 18 Jan 2013 16:32:33 +0200
To: Jonas Sicking <jonas@sicking.cc>
Cc: Wayne Carr <wayne.carr@intel.com>, Christophe Dumez <christophe.dumez@intel.com>, Sakari Poussa <sakari.poussa@intel.com>, "public-sysapps@w3.org" <public-sysapps@w3.org>, EDUARDO FULLEA CARRERA <efc@tid.es>, JOSE MANUEL CANTERA FONSECA <jmcf@tid.es>, Tantek Çelik <tantek@mozilla.com>
Message-ID: <CANrNqUcwm3ufjv21OdOQhKGGksS-4Et_hJW8Y=186dU=O=nLyA@mail.gmail.com>
On Fri, Jan 18, 2013 at 11:21 AM, Jonas Sicking <jonas@sicking.cc> wrote:

> Please note that simply translating the filter call into SQL gives no
> guarantees that the query will execute with acceptable performance.

I agree. Of course it does not give *guarantees*. Is there any generic
or specific solution which can give you guarantees? Then we have
invented the generic problem solving machine...
Of course you also need a good query optimization engine, which takes
into account factors that individual apps usually don't know or care
about (e.g. DB statistics). And even that cannot make wonders, of
course - but it's usually quite good, and there is a lot of experience
with it.

If someone doesn't count on using one, then I agree it's better to use
an IndexedDB like design and manage the indexes separately and do the
query optimizations from the apps.
Actually we have done that too, and there are arguments supporting it,
especially in a web runtime.
However, there are disadvantages, as you also pointed out.

BTW are you sure the query execution time is the real issue here?
The main problem I have been seeing in mobile devices was not as much
(not only) the query performance itself, but how to handle the result
set (e.g. bulk transfer, or streaming + updates, etc), i.e. at data
model level. Basically a given UI which needs a data model wants to
see a filtered and updated/synchronized view of its data.

So if for instance I write a call history app, I know which queries do
I have to implement based on user requirements.
For instance, I want to see call history sorted by decreasing time, by
status (missed, missed-new, dialed, received), and eventually by
contacts or numbers (see all my calls with a given number). My main
query is to select all call history records in decreasing time order,
limited by e.g. 3x my visible list size, and resolve phone numbers to
contacts on the fly; keep my query live updated. If my call history is
not in the same DB as contacts, then whatever indexes I create or
whatever query optimizations I have in both DB's, performance is
screwed. The only thing I can do is build and maintain my data model
in memory, caching the data I need from both databases and make the
necessary indexes. Or build a covering index. This adds latency on
startup and needs a more memory, but works for call history, because
there are not many calls per second :).

Now if I want a messaging app with conversation view of p2p IM
messages, then I need to access the messages, the contacts, and sort
the messages on resolved contacts and time. There are likely much more
messages than calls, and there are much more possible queries
depending on the conversation view requirements. What does the
platform give me to ease that task? A set of all hard-coded queries it
supports, in the form of an API? A query language, or mechanisms to
manage myself the indexes, the query optimizations etc? What is good
from the platform or product point of view (memory usage, latency,
processing load, etc).

As a rule, in these cases we'd build mechanisms rather than policies,
and we'd outsource narrowing down the scope to the implementations,
and don't hard-code them into the API's. Performance tends to work
against generality, consistency and security. Choose a compromise, but
not in the API - push it to the implementation *if possible*.

> All SQL engines that I've ever worked with will happily fall back to
> doing a full table scan if it can't find the proper indexes to use.

Of course, and then scale linearly. But why shouldn't they create
indexes? Insert performance? IMO the performance we are interested in
is query-bound, not update-bound.

> So let's get into specifics.
> How would you write an implementation
> for, for example, a contacts API which stores 5 different properties
> and supports a find function which allows filtering on a single
> arbitrary property. I.e. no composited filters allowed, only filters
> which say that property X should have value Y?
>
> Which indexes would you create and what types of indexes?

Sounds like an exam, which were way too long time ago :).
A hash index for each property in question if the row number is not huge.
BTW you forgot to mention that all the indexes used in a search should
fit in the memory ;).
So it may be a B-tree.

>
> How about if the API supports a find function which supports
> composited filters on two arbitrary properties. I.e. filters that say
> that property X should have value Y and property A should have value
> B.
>
> Again, which indexes of what type would you create?

Again, an index with a specificity indicator for each property.
First get all records which match the more specific column and while
on that, check the condition on the next specific column for each.
You didn't mention result set ordering.

> Keep in mind that we have agreed that performance of the find function
> should be roughly proportional to the number of entries returned.

There is no good general solution for this. A practical trick is to
use query limits, as at any given moment you need a limited subset of
the result set.
Then you end up with what I've been preaching about: we need an
efficient mechanism to provide the result set to the apps, and keep it
updated. We could even implement sharing the most used live data
models in one place, rather than in all apps separately. I would
choose that battle, and keep the filtering API generic.

> Then more detail about the implementation algorithms you'd use, the better.
>
> I can't think of a way to do this with less than 25 b-tree indexes. 30
> if we require that the result is sorted by first name. Twice that if
> we allow sorting by first or last name.

We don't even know our requirements... are we targeting customers with
100K contacts, or ones with 1000 in average?
The most I've heard was someone having around 6000 contacts, but let's
take 10000 as a reasonable max size, which
comfortably fits into any phone's memory today.
Compare your solution (more refined) with my - indeed very rough -
method on 1000 (extreme: 10000) typical contacts, and tell me the
difference in query performance. (We ignore for now that performance
is not the only requirement, data integrity, consistency and security
are also, and that other apps still need to work, e.g. we should be
able to receive incoming calls, etc).

>
> I feel like we are either talking past each other, or there is some
> type of optimization technique that I don't know of.

I hope it's neither ;).

> I actually feel like we are quite far away from each other in our
> proposals on this topic.

I don't think so. We are trying to solve the same problems, that
doesn't leave much room.
Because of that, the proposals are mainly different in their
formulation only. Of which I prefer the more generic, because it
leaves more options open, while permitting to do the same (or near)
optimizations.

> Probably at least in part because I don't
> understand how to implement your proposal with the performance
> constraints that I would like to have.

I think this is because we went too far discussing what are the
alleged problems before agreeing on the use cases.
What are the performance constraints?

> My proposed approach definitely puts a lot more responsibility on the
> application, but I think this is a necessity due to limitations in
> what we *can* implement with reasonable performance.

What exactly do you intend to provide for applications in the API?

> And I also think
> that the amount of use cases you end up being able to support with
> more complex filtering mechanisms is actually relatively small.

Let's enumerate those use cases, based on the existing proposals.
I have the feeling you impose it to be small, confining the platforms
on that constraint.
There are apps which will need complex filtering anyway, see the examples above.
Still, since we don't know what are the use cases of future apps,
specifying with generic filters and narrowing the scope from
documentation would be better. IMO it would allow pretty good
performance too, given that we solve the result set iteration
properly.

Best regards,
Zoltan
Received on Friday, 18 January 2013 14:33:03 UTC