Re: Data models and filtering (was: Re: Contacts API draft) from Jonas Sicking on 2013-01-04 (public-sysapps@w3.org from January 2013)

From: Jonas Sicking <jonas@sicking.cc>
Date: Fri, 4 Jan 2013 02:31:03 -0800
To: "Kis, Zoltan" <zoltan.kis@intel.com>
Cc: "Dumez, Christophe" <christophe.dumez@intel.com>, Tantek Çelik <tantek@mozilla.com>, EDUARDO FULLEA CARRERA <efc@tid.es>, "public-sysapps@w3.org" <public-sysapps@w3.org>, Wayne Carr <wayne.carr@intel.com>, Sakari Poussa <sakari.poussa@intel.com>, JOSE MANUEL CANTERA FONSECA <jmcf@tid.es>
Message-ID: <CA+c2ei_T2stbmYRTJs8=BTMj7uJjtDRCQQXWkm=_GW3417z06A@mail.gmail.com>
On Sat, Nov 10, 2012 at 6:34 AM, Kis, Zoltan <zoltan.kis@intel.com> wrote:
> A few key points from Jonas were:
> - performance is key issue, and JS-only implementations are valid use cases

I agree on the "performance is key" part. But I think I might have
been unclear when talking about the JS-only implementations part.

I don't think the fact that our implementation is implemented entirely
in JS is an argument against moving logic into the API. If we can make
an API significantly better by requiring that it's implemented in a
low-level language like C++ then I think that's an ok requirement to
make. We should optimize for APIs that are good for authors, not APIs
that are easy to implement.

This of course this only applies to a point. At some point if
implementations can't implement an API because we have too touch
requirements, then the API isn't useful.

But I don't think requiring a C++ implementation is too tough of a requirement.

My point with the argument about JS implementations is that JS is
quite fast these days. We shouldn't move all logic into the API just
because we assume that putting some logic in the app itself will make
the API too slow. We have to draw a line at some point and say that
some things will have to be solved with application level logic. In
some cases using JS libraries.

So for example I don't think that moving all filtering functionality
into the API with the argument "it can be done faster in the
implementation because the implementation can be written in a faster
language" is always a good argument. That argument leads to enormous
APIs and often less flexible ones.

If the argument is that something can be done faster in the API, that
needs to be backed up with data unless it's obvious that that's the
case. For example if an reasonable implementation has algorithmic
advantages over what the page would need to do.

> - there is doubt that a general purpose query API is performant enough
> (in general) for the above use case, in order to build data models for
> apps at a rate required by the app

Indeed. I think it only makes sense to move querying features into the
implementation if we expect that the implementation will be able to do
the querying faster. For example if the implementation can use an
index which allows it to go directly to the 10 rows queried, rather
than having to filter through all of the 10000 records in the database
that backs the implementation.

Sure, the implementation might be able to filter on a property value
faster than the application if both of them have to go through all
10000 records in the backing database. But the difference is likely a
few percent rather than orders of magnitude.

If either the implementation or the application has to go through
10000 records to find the 10 that are queried then I think our
solution is simply too slow and we haven't actually solved the
problem.

A basic design constraint that I think we should have is that we
should only support features in the query API where we can guarantee
that the implementation can return results in time which is somewhat
proportional to the number of rows returned. It's not acceptable that
the execution time is proportional to the number of records in the
database.

> - therefore there must be a method to cache or sync data (or a full
> database) at JS side in order to control performance-critical
> operations from there

Indeed. While implementations can add indexes to support some
filtering in the API, it quickly gets to the point when too many
indexes are needed to ensure that the API is quick in all cases.

For example for the SMS API in Firefox OS we currently allow filtering
on 4 different properties: date, number, read, direction (sent vs.
received). The result is always returned sorted by date.

In order to ensure that we can implement possible combinations of
filters we would have to create 8 indexes:

number, read, direction, date
number, read, date
number, date
number, direction, date
read, direction, date
read, date
direction, date
date

Even if we pack this pretty tightly, it easily adds up to more data
than is in the actual message. I.e. the indexes have doubled the
amount of data needed to store the message database. And that's just
the size. Each modification to the database also requires updating all
indexes, so now inserting a message requires updating 9 btrees instead
of 1.

All of this is needed even if only one or two of the indexes are
actually ever used. This because the API can't know what filters an
application might want to use but needs to be able to be responsive
once it's used.

And we'll likely soon have to add support for filtering on sending
status ("sending" vs. "sent" vs. "error") as well as delivery status.
Which would quickly make the number of indexes grow out of hand.

In theory we can be a bit smarter than this and reduce the number of
indexes by taking advantage of the fact that two of the fields here
are boolean. So we can search through all possible values and do union
operations over multiple calls of the same index. But that wouldn't be
possible if more fields were string fields rather than booleans.

With something like a Contacts API, the number of indexes needed to
support fast queries of all possible filters ends up being absolutely
absurd.

And even with all this, we still can't support even the most common
UIs that SMS applications do today. Most SMS applications today
display a "thread overview". I.e. you can see a list of people that
you have communicated with, and possibly see if you have any unread
messages from that person and/or what the latest sent or received
message was. Even if we had all of the above indexes, the only way to
build such a UI is to do a full scan of the full database scan as soon
as the SMS app is opened, which is simply not acceptable.

Another case that is extremely hard to support is full-text search.
Complexities like internationalized word breaking and treating word
variations (like plural forms) equivalent means that we would be
forced to do something very generic with the application having next
to no control.

And note that none of this is helped even if the API is implemented in
C++. The limitations here is in how much data needs to be scanned
which means that the operation will be IO bound rather than CPU bound.
So it doesn't matter if you used multiple cores in the implementation,
the limitation is in how quickly the full SMS database can be read
from disk.

> - there was a proposal for this sync is done by polling delta's from
> data sources.

Yup. I believe that this is the only way to create an API which is
generic enough to be application agnostic, while still being
performant.

> I agree with these, with the following notes/additions:
> - some data, e.g. anything to do with close-to-real-time constraints
> must be managed from the native side - or the same side where the
> protocols are handled -, (e.g. call history, otherwise we risk losing
> incoming calls), and only _exposed_ to JS side.

I agree. Another reason that the implementation needs to manage the
data is so that if you install a new app, that app can pull down all
the already-existing data.

> One could say that in these cases the middleware could maintain a
> local sync cache, which is erased after syncing to JS side, so this
> can be solved. However, there may be multiple clients for the data,
> which can be native or JS or both, so the requirement is that if an
> implementation choses to replicate/sync data to the JS side, must keep
> it in 2-way sync with the origin of the data (i.e. changes done on JS
> side are synced back to native side).

I wouldn't think of it as a 2-way sync, but rather as the application
keeping a write-through cache.

I.e. the application can cache whatever data it wants, but any time it
wants to modify something, it needs to make an explicit call to the
API and describe exactly what data it wants modified. So in a Contacts
manager app, if the application wants to change the phone number for a
given contact the application would be responsible for doing the
appropriate write for the appropriate contact-id.

If the application does some sort of lazy caching, or if it's only
caching part of the data, it would be the responsibility of the
application to ensure that it didn't overwrite the parts of the
contact that it does not intend to change.

We could let the application choose if it wants to be notified about
the changes that it itself makes, or if it wants to write to the
backend and then simply make the same modification in its local cache.

> - the performance bottleneck you mentioned is usually not in the
> native database itself (key-value store, SQL DB, semantic DB,
> distributed heterogenous DB, etc). In most cases, the bottleneck is in
> the way query results or data models are exposed to the JS side. A DB
> implementation in JS will certainly not be faster per se than a native
> implementation of similar design, but having it on JS side may be more
> efficient because the eventual IPC bottleneck (which is not mandatory
> to have, either). So I claim this performance issue is an
> implementation issue, and not an API issue.

I'm not sure I fully understand what you are saying here. But yes, I
agree that the limitations are generally in how quickly you can query
data from a database.

I hope the description I gave above clarifies why I think it's simply
impossible to create a generic query mechanism which is always fast.
No matter what languages are used in the implementation.

> - at least when apps are in foreground, in addition to polling, data
> model updates should also be possible by events, with optional
> low-pass filters controlled by the app. This is a classical
> synchronization use case, and I would support both polling and
> (filtered-)event-firing mechanisms in the API.

Agreed.

> - what applications really need are data models for feeding their
> views, e.g. roster/contact list with presence information, or call
> history, or messaging conversation view, etc. Sources may be
> heterogenous. So, apps need to maintain 'live' objects describing
> their data model, and in case of big data, a high-fps viewport over
> that data. This is non-trivial to solve for the generic case. On the
> other hand, just providing data source API's and defer the solution to
> the JS side may not be enough either, since certain optimizations can
> only be done across a full vertical. What is important here is the
> freedom of choice for developers.

Not sure I understand everything here.

> - the W3C API's will also be used in products which primarily support
> native apps for call, messaging and contacts, have a database on
> native side, with enough optimizations that they could expose their
> native data models on JS side efficiently so that JS apps could access
> the same data, for any generic purpose. Under these conditions,
> generic filters from JS would work well and we should not prevent
> developers from doing so. This is a valid use case, too.

As long as you can ensure that all calls into the API can be
implemented such that they reliably return results fast, then I agree.

But I actually think that in implementations which sit upon existing
database which are serving "native apps" it'll be even harder to
ensure that all calls from the API are served in a performant manner.

> Now the question is: could we support both use cases? (please :)
>
> One of my drafts for CallHistory looks like this (I replace
> CallHistory with DataSync for this example). It is an asynchronous
> API.

Let me present what I think a rough outline of a CallHistory API would
look like. It's on the high end feature-wise of what I think is needed
for something as simple as CallHistory, but it'll more clearly
demonstrate how the pattern could be applied to other database-backed
APIs.

dictionary CallHistoryEnumOptions
{
  Date? start = null;
  Date? end = null;
  DOMString? number = null;
  long? limit = null;
  boolean reverse = false;
};

enum CallType
{
  "received",
  "missed",
  "noanswer",
  "placed"
};

[Constructor]
interface CallHistoryEntry
{
  readonly attribute long id;
  attribute Date start;
  attribute float length; // seconds;
  attribute DOMString number;
  attribute CallType type;
};

enum CallHistoryChangeType
{
  "added",
  "removed",
  "changed",
  "reset"
}

interface CallHistoryChangeEntry
{
  readonly attribute long id;
  readonly attribute CallHistoryChangeType type;
}

interface CallHistory : EventTarget
{
  DOMRequest enumerate(CallHistoryEnumOptions);
  DOMRequest clear();
  DOMRequest save(CallHistoryEntry); // Maybe not needed
  DOMRequest remove(long id);

  DOMRequest startTrackingChanges();
  DOMRequest stopTrackingChanges();

  EventHandler onchangesavailable;
  DOMRequest getChanges();
};

When something is changed in the callback database, the implementation
fires the "changesavailable" event. Unless that event has already been
fired since the last time the application called getChanges(). This
event doesn't contain any information other than the name of the
event.

After that a change is added to the change-log for that app. The
implementation conceptually keeps separate change logs for separate
apps, though this can of course be optimized internally. Only
applications which have called startTrackingChanges() will have a
changelog. However the fact that an application has called
startTrackingChanges is remembered across invocations of the app. I.e.
if an app calls startTrackingChanges and is then closed, the
implementation will still keep a changelog for that app.

Once an application calls getChanges() the contents of the change log
is delivered to the app and the change log for that app is cleared.
That doesn't affect any pending change logs for any other apps though.

There are a couple of ways that the change-log is simplified as it is
recorded. For example if an entry is deleted, all previous "changed"
entries for that id can be removed. And if there was a "added" entry
for that id, then both the "added" and the "removed" entry can be
removed as well.

And if a "changed" entry exists for an id in the change log for an app
and the entry is changed again, no additional "changed" entry needs to
be added.

For APIs which store more complex data it might make sense to keep a
more advanced changed log. Rather than simply recording that an entry
is changed, we could also record which field was changed and what the
new value of the field is. That obviously affects how entries in the
changelog can be collapsed when multiple changes are made to an entry.
But in general we should start simple and see if simply tracking that
the entry changed is sufficient.

The "reset" value is a special value which means that the API
implementation ended up getting a change log which was larger than it
was able to track. In that case the API can drop the whole change log
and only add a single entry whose .type is set to "reset". When an
application receives such a value it needs to drop it's current cache
and re-query all data. This is expected to be a rare event and we can
make it even more rare by introducing system messages which allow the
platform to wake up the application and sync cached data before the
change-log grows out of hand.

However note that the API also supports doing extremely simple
queries. It only allows filtering on phone number and date range. As
well as supports a limit on the number of records returned.

This can be implemented using only two indexes in the implementation's
backing database. Yet it allows for implementing most simple UIs.

So some applications might not need to use the change-log feature at
all, in which case it won't call startTrackingChanges() and the
implementation won't need to store a change log for that app.

And if an application wants to be able to display an overview of who
I've been calling with and when the last call was, it can use the
change-log feature to only cache that information. It could at the
same time use the querying feature to display full call logs for
individual numbers. That way only the minimum amount of information
needs to be duplicated.

I don't know that I see any advantages of having a generic sync
baseclass which is then inherited by APIs that want to support
caching. I don't see any real advantages to developers since it's
unlikely to lead to more code reuse given how differently for example
SMS data vs. Contacts data will be cached. Reusing the same patterns
makes a lot of sense since that makes it easier for developers to
reuse experience and design patterns, but I'm not sure that having
common base classes will get us a whole lot.

The above is definitely somewhat rough. For example we shouldn't use a
constructor for CallHistoryEntry but rather a callback interface or
dictionary. And we might not need the ability to filter on date ranges
when querying the database, numerical limits might be enough.

But this hopefully shows the basic idea.

/ Jonas
Received on Friday, 4 January 2013 10:32:00 UTC