Re: Data models and filtering (was: Re: Contacts API draft) from Jonas Sicking on 2013-01-18 (public-sysapps@w3.org from January 2013)

From: Jonas Sicking <jonas@sicking.cc>
Date: Fri, 18 Jan 2013 01:21:48 -0800
To: Zoltan Kis <zoltan.kis@intel.com>
Cc: Wayne Carr <wayne.carr@intel.com>, Christophe Dumez <christophe.dumez@intel.com>, Sakari Poussa <sakari.poussa@intel.com>, "public-sysapps@w3.org" <public-sysapps@w3.org>, EDUARDO FULLEA CARRERA <efc@tid.es>, JOSE MANUEL CANTERA FONSECA <jmcf@tid.es>, Tantek Çelik <tantek@mozilla.com>
Message-ID: <CA+c2ei_eB0HA1VpHS82gjdU-dzcpWP=gQauoWbSm3U5J7gM1NA@mail.gmail.com>

On Jan 7, 2013 12:54 AM, "Kis, Zoltan" <zoltan.kis@intel.com> wrote:
> On Fri, Jan 4, 2013 at 12:31 PM, Jonas Sicking <jonas@sicking.cc> wrote:

I think there are two main issues that we are discussing in this thread:

* What type of filtering to allow when querying the database.
* How to build a synchronization mechanism to allow an application to
store the data locally such that it can perform queries which aren't
supported by the API.

I think the second point we'll have an easier time to get to agreement
on since I think we agree on the capability set there. So let's attack
the first point separately.

> > So for example I don't think that moving all filtering functionality
> > into the API with the argument "it can be done faster in the
> > implementation because the implementation can be written in a faster
> > language" is always a good argument. That argument leads to enormous
> > APIs and often less flexible ones.
>
> Specifying filters in the API makes possible an end-to-end data query
> optimization, since you can carry everything "what you want" down to
> the DB level and slice and translate it there to the specifics of your
> DB implementation. If I understood right, in your case you'd bring up
> the DB specifics to the app level and make optimizations in the app,
> that's why you don't see advantage in it.
>
> BTW I don't think the filtering API as specified in Messaging and
> Contacts is complex. We have implemented them a while back and they
> match nicely to SQL, SPARQL and proprietary DB searches.

> > Indeed. While implementations can add indexes to support some
> > filtering in the API, it quickly gets to the point when too many
> > indexes are needed to ensure that the API is quick in all cases.
>
> This issue is solved in most database implementations.

This is not the case with any of the databases that I have worked
with. But I admit I'm not a database expert.

Please note that simply translating the filter call into SQL gives no
guarantees that the query will execute with acceptable performance.

All SQL engines that I've ever worked with will happily fall back to
doing a full table scan if it can't find the proper indexes to use.

So let's get into specifics. How would you write an implementation
for, for example, a contacts API which stores 5 different properties
and supports a find function which allows filtering on a single
arbitrary property. I.e. no composited filters allowed, only filters
which say that property X should have value Y?

Which indexes would you create and what types of indexes?

How about if the API supports a find function which supports
composited filters on two arbitrary properties. I.e. filters that say
that property X should have value Y and property A should have value
B.

Again, which indexes of what type would you create?

Keep in mind that we have agreed that performance of the find function
should be roughly proportional to the number of entries returned.

Then more detail about the implementation algorithms you'd use, the better.

I can't think of a way to do this with less than 25 b-tree indexes. 30
if we require that the result is sorted by first name. Twice that if
we allow sorting by first or last name.

I feel like we are either talking past each other, or there is some
type of optimization technique that I don't know of.

I actually feel like we are quite far away from each other in our
proposals on this topic. Probably at least in part because I don't
understand how to implement your proposal with the performance
constraints that I would like to have.

My proposed approach definitely puts a lot more responsibility on the
application, but I think this is a necessity due to limitations in
what we *can* implement with reasonable performance. And I also think
that the amount of use cases you end up being able to support with
more complex filtering mechanisms is actually relatively small. Which
is why we'll need the sync mechanism.

/ Jonas

Received on Friday, 18 January 2013 09:22:45 UTC