Re: Data models and filtering (was: Re: Contacts API draft)

On Fri, Jan 4, 2013 at 12:31 PM, Jonas Sicking <jonas@sicking.cc> wrote:

This is a long conversation. In up-front summary: we agree on most
issues, the question is what to do with the API. I still ask to be
permissive towards hybrid native-web platforms.

> We should optimize for APIs that are good for authors, not APIs
> that are easy to implement.

Agreed.

>
> This of course this only applies to a point. At some point if
> implementations can't implement an API because we have too touch
> requirements, then the API isn't useful.
>
> But I don't think requiring a C++ implementation is too tough of a requirement.

It is not at all a requirement, and should not be.

>
> My point with the argument about JS implementations is that JS is
> quite fast these days. We shouldn't move all logic into the API just
> because we assume that putting some logic in the app itself will make
> the API too slow. We have to draw a line at some point and say that
> some things will have to be solved with application level logic. In
> some cases using JS libraries.

If this logic affects other apps or the system, then it should be done
in one place. Whether that place is a library (native or JS) can be
left to the implementation. However, a specification can help
describing the problem and requirements for implementations. I agree
we should not bind our hands more than necessary (maybe defining an
API is an overshoot) - but we do need to do it to certain extent,
otherwise we can also say e.g. please implement this or that standard,
period. The API's are meant for making it easier to use complex things
by app developers, so we should try to ease their life if possible,
instead of deferring the hard work to some to-be-defined entity.
Drawing that line is difficult.

>
> So for example I don't think that moving all filtering functionality
> into the API with the argument "it can be done faster in the
> implementation because the implementation can be written in a faster
> language" is always a good argument. That argument leads to enormous
> APIs and often less flexible ones.

Specifying filters in the API makes possible an end-to-end data query
optimization, since you can carry everything "what you want" down to
the DB level and slice and translate it there to the specifics of your
DB implementation. If I understood right, in your case you'd bring up
the DB specifics to the app level and make optimizations in the app,
that's why you don't see advantage in it.

BTW I don't think the filtering API as specified in Messaging and
Contacts is complex. We have implemented them a while back and they
match nicely to SQL, SPARQL and proprietary DB searches.

The problem of reimplementing databases is testing. Millions of tests
run for every minor release. But a man has to do what a man's got to
do :).

>> - there is doubt that a general purpose query API is performant enough
>> (in general) for the above use case, in order to build data models for
>> apps at a rate required by the app
>
> Indeed. I think it only makes sense to move querying features into the
> implementation if we expect that the implementation will be able to do
> the querying faster. For example if the implementation can use an
> index which allows it to go directly to the 10 rows queried, rather
> than having to filter through all of the 10000 records in the database
> that backs the implementation.
>
> Sure, the implementation might be able to filter on a property value
> faster than the application if both of them have to go through all
> 10000 records in the backing database. But the difference is likely a
> few percent rather than orders of magnitude.
>
> If either the implementation or the application has to go through
> 10000 records to find the 10 that are queried then I think our
> solution is simply too slow and we haven't actually solved the
> problem.

The first is the case - in our case.

Now - having the filters in the API does not prevent any app using
even a proprietary solution.
Having it in the API would allow fulfilling more use cases and both
native-web and web-only architectures. If in some cases one can get
better performance by not using the filters in the API (they are
optional), then just feel free to do otherwise, document it in the API
and that's it. The generic filters can always be translated to simple
filters based on the documentation.  So the same apps would still work
on other platforms, although they would work better on yours. Isn't
that desirable? :)

>
> A basic design constraint that I think we should have is that we
> should only support features in the query API where we can guarantee
> that the implementation can return results in time which is somewhat
> proportional to the number of rows returned. It's not acceptable that
> the execution time is proportional to the number of records in the
> database.

Agreed.

>
>> - therefore there must be a method to cache or sync data (or a full
>> database) at JS side in order to control performance-critical
>> operations from there
>
> Indeed. While implementations can add indexes to support some
> filtering in the API, it quickly gets to the point when too many
> indexes are needed to ensure that the API is quick in all cases.

This issue is solved in most database implementations. But there
should be the option to device makers to get exported all DB to JS
side and manage data entirely from there.

>
> For example for the SMS API in Firefox OS we currently allow filtering
> on 4 different properties: date, number, read, direction (sent vs.
> received). The result is always returned sorted by date.
>
> In order to ensure that we can implement possible combinations of
> filters we would have to create 8 indexes:
>
> number, read, direction, date
> number, read, date
> number, date
> number, direction, date
> read, direction, date
> read, date
> direction, date
> date
>
> Even if we pack this pretty tightly, it easily adds up to more data
> than is in the actual message. I.e. the indexes have doubled the
> amount of data needed to store the message database. And that's just
> the size. Each modification to the database also requires updating all
> indexes, so now inserting a message requires updating 9 btrees instead
> of 1.
>
> All of this is needed even if only one or two of the indexes are
> actually ever used. This because the API can't know what filters an
> application might want to use but needs to be able to be responsive
> once it's used.
>
> And we'll likely soon have to add support for filtering on sending
> status ("sending" vs. "sent" vs. "error") as well as delivery status.
> Which would quickly make the number of indexes grow out of hand.
>
> In theory we can be a bit smarter than this and reduce the number of
> indexes by taking advantage of the fact that two of the fields here
> are boolean. So we can search through all possible values and do union
> operations over multiple calls of the same index. But that wouldn't be
> possible if more fields were string fields rather than booleans.

I have the feeling you've been describing a subset of query
optimization techniques standard to most databases.

>
> With something like a Contacts API, the number of indexes needed to
> support fast queries of all possible filters ends up being absolutely
> absurd.
>
> And even with all this, we still can't support even the most common
> UIs that SMS applications do today. Most SMS applications today
> display a "thread overview". I.e. you can see a list of people that
> you have communicated with, and possibly see if you have any unread
> messages from that person and/or what the latest sent or received
> message was. Even if we had all of the above indexes, the only way to
> build such a UI is to do a full scan of the full database scan as soon
> as the SMS app is opened, which is simply not acceptable.

I know this pain, but there is a solution :) Messaging conversation
view was one of the main reason I was pushing for a system-wide
database for connected data domains: Contacts, Messaging, Call
History, Calendar. We have done this in different ways over time and
products, of course each with its own compromise, but with pretty good
results. The art is how to hide these behind usable API's, so that
device makers can choose their compromises in the implementation side
and we don't have any hard-coded compromises imposed by the API.

>
> Another case that is extremely hard to support is full-text search.
> Complexities like internationalized word breaking and treating word
> variations (like plural forms) equivalent means that we would be
> forced to do something very generic with the application having next
> to no control.

Absolutely right. This is among the harder things to get right.

>
> And note that none of this is helped even if the API is implemented in
> C++. The limitations here is in how much data needs to be scanned
> which means that the operation will be IO bound rather than CPU bound.
> So it doesn't matter if you used multiple cores in the implementation,
> the limitation is in how quickly the full SMS database can be read
> from disk.
>
>> - there was a proposal for this sync is done by polling delta's from
>> data sources.
>
> Yup. I believe that this is the only way to create an API which is
> generic enough to be application agnostic, while still being
> performant.
>
>> I agree with these, with the following notes/additions:
>> - some data, e.g. anything to do with close-to-real-time constraints
>> must be managed from the native side - or the same side where the
>> protocols are handled -, (e.g. call history, otherwise we risk losing
>> incoming calls), and only _exposed_ to JS side.
>
> I agree. Another reason that the implementation needs to manage the
> data is so that if you install a new app, that app can pull down all
> the already-existing data.
>
>> One could say that in these cases the middleware could maintain a
>> local sync cache, which is erased after syncing to JS side, so this
>> can be solved. However, there may be multiple clients for the data,
>> which can be native or JS or both, so the requirement is that if an
>> implementation choses to replicate/sync data to the JS side, must keep
>> it in 2-way sync with the origin of the data (i.e. changes done on JS
>> side are synced back to native side).
>
> I wouldn't think of it as a 2-way sync, but rather as the application
> keeping a write-through cache.

I commented on this in the other mail. Indeed we do speak about 2
different use cases.

>
> I.e. the application can cache whatever data it wants, but any time it
> wants to modify something, it needs to make an explicit call to the
> API and describe exactly what data it wants modified. So in a Contacts
> manager app, if the application wants to change the phone number for a
> given contact the application would be responsible for doing the
> appropriate write for the appropriate contact-id.

That is easy. The harder part is to add a new contact card to the DB
(who generates the contact-id?).

>
> If the application does some sort of lazy caching, or if it's only
> caching part of the data, it would be the responsibility of the
> application to ensure that it didn't overwrite the parts of the
> contact that it does not intend to change.

Do you expect each app to devise its own scheme for that?

>
> We could let the application choose if it wants to be notified about
> the changes that it itself makes, or if it wants to write to the
> backend and then simply make the same modification in its local cache.

Write to local cache and sync the cache.

> I hope the description I gave above clarifies why I think it's simply
> impossible to create a generic query mechanism which is always fast.
> No matter what languages are used in the implementation.

As said, there is a long history of database development and nowadays
you can choose the compromise you want. I say it is possible to do it
with pretty good results :).

How is it different to express this by a generic filter and specifying
in the API documentation which fields are supported by the
implementation, from explicitly enumerating those properties in the
API?
The former is a more generic way _to describe_ it vs hard-coding it in
the API. In both cases, the implementation needs to do the same
things.
Except in the case, when an app first asks for all contacts with
surname John, and then the ones which were calling yesterday (or vice
versa) and does the intersection itself. All I want to say it is
better to do the optimizations and joins in the DB, rather than by the
apps.

>> - what applications really need are data models for feeding their
>> views, e.g. roster/contact list with presence information, or call
>> history, or messaging conversation view, etc. Sources may be
>> heterogenous. So, apps need to maintain 'live' objects describing
>> their data model, and in case of big data, a high-fps viewport over
>> that data. This is non-trivial to solve for the generic case. On the
>> other hand, just providing data source API's and defer the solution to
>> the JS side may not be enough either, since certain optimizations can
>> only be done across a full vertical. What is important here is the
>> freedom of choice for developers.
>
> Not sure I understand everything here.

Above I made a few points regarding this - if it's not more clear
please ask. I am sometimes not very good in expressing myself :).

>
>> - the W3C API's will also be used in products which primarily support
>> native apps for call, messaging and contacts, have a database on
>> native side, with enough optimizations that they could expose their
>> native data models on JS side efficiently so that JS apps could access
>> the same data, for any generic purpose. Under these conditions,
>> generic filters from JS would work well and we should not prevent
>> developers from doing so. This is a valid use case, too.
>
> As long as you can ensure that all calls into the API can be
> implemented such that they reliably return results fast, then I agree.
>
> But I actually think that in implementations which sit upon existing
> database which are serving "native apps" it'll be even harder to
> ensure that all calls from the API are served in a performant manner.

Is that doubt alone enough reason to ban generic filters from the API? :)
The question is, do they hinder your use cases? If not, let also the
native side live :).

>
>> Now the question is: could we support both use cases? (please :)
>>
>> One of my drafts for CallHistory looks like this (I replace
>> CallHistory with DataSync for this example). It is an asynchronous
>> API.
>
> Let me present what I think a rough outline of a CallHistory API would
> look like. It's on the high end feature-wise of what I think is needed
> for something as simple as CallHistory, but it'll more clearly
> demonstrate how the pattern could be applied to other database-backed
> APIs.
>
> dictionary CallHistoryEnumOptions
> {
>   Date? start = null;
>   Date? end = null;
>   DOMString? number = null;
>   long? limit = null;
>   boolean reverse = false;
> };
>
> enum CallType
> {
>   "received",
>   "missed",
>   "noanswer",
>   "placed"
> };
>
> [Constructor]
> interface CallHistoryEntry
> {
>   readonly attribute long id;

Generating this id is tricky, not insolvable, but needs to be
specified correctly.

>   attribute Date start;
>   attribute float length; // seconds;
>   attribute DOMString number;
>   attribute CallType type;
> };

Close to my idea of call history, we could merge them.

>
> enum CallHistoryChangeType
> {
>   "added",
>   "removed",
>   "changed",
>   "reset"
> }
>
> interface CallHistoryChangeEntry
> {
>   readonly attribute long id;
>   readonly attribute CallHistoryChangeType type;
> }
>
> interface CallHistory : EventTarget
> {
>   DOMRequest enumerate(CallHistoryEnumOptions);

This is equivalent to "find()" with a filter. A hard-coded filter, but
still a filter.

>   DOMRequest clear();

Clear the local, or the system side? I guess the local.

>   DOMRequest save(CallHistoryEntry); // Maybe not needed

There is one owner to call history DB: the telephony middleware.

>   DOMRequest remove(long id);
>
>   DOMRequest startTrackingChanges();
>   DOMRequest stopTrackingChanges();
>
>   EventHandler onchangesavailable;
>   DOMRequest getChanges();
> };

Well, this is basically the same or very similar to what I specified.
We pretty much agree here. I got the internal feedback to separate
search from sync, I did that.

>
> When something is changed in the callback database, the implementation
> fires the "changesavailable" event. Unless that event has already been
> fired since the last time the application called getChanges(). This
> event doesn't contain any information other than the name of the
> event.
>
> After that a change is added to the change-log for that app. The
> implementation conceptually keeps separate change logs for separate
> apps, though this can of course be optimized internally. Only
> applications which have called startTrackingChanges() will have a
> changelog. However the fact that an application has called
> startTrackingChanges is remembered across invocations of the app. I.e.
> if an app calls startTrackingChanges and is then closed, the
> implementation will still keep a changelog for that app.
>
> Once an application calls getChanges() the contents of the change log
> is delivered to the app and the change log for that app is cleared.
> That doesn't affect any pending change logs for any other apps though.
>
> There are a couple of ways that the change-log is simplified as it is
> recorded. For example if an entry is deleted, all previous "changed"
> entries for that id can be removed. And if there was a "added" entry
> for that id, then both the "added" and the "removed" entry can be
> removed as well.
>
> And if a "changed" entry exists for an id in the change log for an app
> and the entry is changed again, no additional "changed" entry needs to
> be added.
>
> For APIs which store more complex data it might make sense to keep a
> more advanced changed log. Rather than simply recording that an entry
> is changed, we could also record which field was changed and what the
> new value of the field is. That obviously affects how entries in the
> changelog can be collapsed when multiple changes are made to an entry.
> But in general we should start simple and see if simply tracking that
> the entry changed is sufficient.
>
> The "reset" value is a special value which means that the API
> implementation ended up getting a change log which was larger than it
> was able to track. In that case the API can drop the whole change log
> and only add a single entry whose .type is set to "reset". When an
> application receives such a value it needs to drop it's current cache
> and re-query all data. This is expected to be a rare event and we can
> make it even more rare by introducing system messages which allow the
> platform to wake up the application and sync cached data before the
> change-log grows out of hand.
>
> However note that the API also supports doing extremely simple
> queries. It only allows filtering on phone number and date range. As
> well as supports a limit on the number of records returned.
>
> This can be implemented using only two indexes in the implementation's
> backing database. Yet it allows for implementing most simple UIs.
>
> So some applications might not need to use the change-log feature at
> all, in which case it won't call startTrackingChanges() and the
> implementation won't need to store a change log for that app.
>
> And if an application wants to be able to display an overview of who
> I've been calling with and when the last call was, it can use the
> change-log feature to only cache that information. It could at the
> same time use the querying feature to display full call logs for
> individual numbers. That way only the minimum amount of information
> needs to be duplicated.
>
> I don't know that I see any advantages of having a generic sync
> baseclass which is then inherited by APIs that want to support
> caching. I don't see any real advantages to developers since it's
> unlikely to lead to more code reuse given how differently for example
> SMS data vs. Contacts data will be cached. Reusing the same patterns
> makes a lot of sense since that makes it easier for developers to
> reuse experience and design patterns, but I'm not sure that having
> common base classes will get us a whole lot.
>
> The above is definitely somewhat rough. For example we shouldn't use a
> constructor for CallHistoryEntry but rather a callback interface or
> dictionary. And we might not need the ability to filter on date ranges
> when querying the database, numerical limits might be enough.

We have the same thinking on this - I think I ended up with a simpler
API for the same thing (once I introduced client-specific versions in
the sync API), but the thinking is the same.

However, in the case of call history, the only thing you will likely
need is a simple API to fetch the new data, which is forgotten on the
other side as soon as you have read and ack'd it.

If we have the other scenario, like a native DB that does the above,
and the web runtime keeps a synchronized local copy of that DB, then
this or similar API makes sense. But I wonder how to specify the API
in a way this is hidden in the implementation, and up to the choice of
the device maker.

Best regards,
Zoltan

Received on Monday, 7 January 2013 08:55:01 UTC