- From: Kis, Zoltan <zoltan.kis@intel.com>
- Date: Fri, 18 Jan 2013 16:32:33 +0200
- To: Jonas Sicking <jonas@sicking.cc>
- Cc: Wayne Carr <wayne.carr@intel.com>, Christophe Dumez <christophe.dumez@intel.com>, Sakari Poussa <sakari.poussa@intel.com>, "public-sysapps@w3.org" <public-sysapps@w3.org>, EDUARDO FULLEA CARRERA <efc@tid.es>, JOSE MANUEL CANTERA FONSECA <jmcf@tid.es>, Tantek Çelik <tantek@mozilla.com>
On Fri, Jan 18, 2013 at 11:21 AM, Jonas Sicking <jonas@sicking.cc> wrote: > Please note that simply translating the filter call into SQL gives no > guarantees that the query will execute with acceptable performance. I agree. Of course it does not give *guarantees*. Is there any generic or specific solution which can give you guarantees? Then we have invented the generic problem solving machine... Of course you also need a good query optimization engine, which takes into account factors that individual apps usually don't know or care about (e.g. DB statistics). And even that cannot make wonders, of course - but it's usually quite good, and there is a lot of experience with it. If someone doesn't count on using one, then I agree it's better to use an IndexedDB like design and manage the indexes separately and do the query optimizations from the apps. Actually we have done that too, and there are arguments supporting it, especially in a web runtime. However, there are disadvantages, as you also pointed out. BTW are you sure the query execution time is the real issue here? The main problem I have been seeing in mobile devices was not as much (not only) the query performance itself, but how to handle the result set (e.g. bulk transfer, or streaming + updates, etc), i.e. at data model level. Basically a given UI which needs a data model wants to see a filtered and updated/synchronized view of its data. So if for instance I write a call history app, I know which queries do I have to implement based on user requirements. For instance, I want to see call history sorted by decreasing time, by status (missed, missed-new, dialed, received), and eventually by contacts or numbers (see all my calls with a given number). My main query is to select all call history records in decreasing time order, limited by e.g. 3x my visible list size, and resolve phone numbers to contacts on the fly; keep my query live updated. If my call history is not in the same DB as contacts, then whatever indexes I create or whatever query optimizations I have in both DB's, performance is screwed. The only thing I can do is build and maintain my data model in memory, caching the data I need from both databases and make the necessary indexes. Or build a covering index. This adds latency on startup and needs a more memory, but works for call history, because there are not many calls per second :). Now if I want a messaging app with conversation view of p2p IM messages, then I need to access the messages, the contacts, and sort the messages on resolved contacts and time. There are likely much more messages than calls, and there are much more possible queries depending on the conversation view requirements. What does the platform give me to ease that task? A set of all hard-coded queries it supports, in the form of an API? A query language, or mechanisms to manage myself the indexes, the query optimizations etc? What is good from the platform or product point of view (memory usage, latency, processing load, etc). As a rule, in these cases we'd build mechanisms rather than policies, and we'd outsource narrowing down the scope to the implementations, and don't hard-code them into the API's. Performance tends to work against generality, consistency and security. Choose a compromise, but not in the API - push it to the implementation *if possible*. > All SQL engines that I've ever worked with will happily fall back to > doing a full table scan if it can't find the proper indexes to use. Of course, and then scale linearly. But why shouldn't they create indexes? Insert performance? IMO the performance we are interested in is query-bound, not update-bound. > So let's get into specifics. > How would you write an implementation > for, for example, a contacts API which stores 5 different properties > and supports a find function which allows filtering on a single > arbitrary property. I.e. no composited filters allowed, only filters > which say that property X should have value Y? > > Which indexes would you create and what types of indexes? Sounds like an exam, which were way too long time ago :). A hash index for each property in question if the row number is not huge. BTW you forgot to mention that all the indexes used in a search should fit in the memory ;). So it may be a B-tree. > > How about if the API supports a find function which supports > composited filters on two arbitrary properties. I.e. filters that say > that property X should have value Y and property A should have value > B. > > Again, which indexes of what type would you create? Again, an index with a specificity indicator for each property. First get all records which match the more specific column and while on that, check the condition on the next specific column for each. You didn't mention result set ordering. > Keep in mind that we have agreed that performance of the find function > should be roughly proportional to the number of entries returned. There is no good general solution for this. A practical trick is to use query limits, as at any given moment you need a limited subset of the result set. Then you end up with what I've been preaching about: we need an efficient mechanism to provide the result set to the apps, and keep it updated. We could even implement sharing the most used live data models in one place, rather than in all apps separately. I would choose that battle, and keep the filtering API generic. > Then more detail about the implementation algorithms you'd use, the better. > > I can't think of a way to do this with less than 25 b-tree indexes. 30 > if we require that the result is sorted by first name. Twice that if > we allow sorting by first or last name. We don't even know our requirements... are we targeting customers with 100K contacts, or ones with 1000 in average? The most I've heard was someone having around 6000 contacts, but let's take 10000 as a reasonable max size, which comfortably fits into any phone's memory today. Compare your solution (more refined) with my - indeed very rough - method on 1000 (extreme: 10000) typical contacts, and tell me the difference in query performance. (We ignore for now that performance is not the only requirement, data integrity, consistency and security are also, and that other apps still need to work, e.g. we should be able to receive incoming calls, etc). > > I feel like we are either talking past each other, or there is some > type of optimization technique that I don't know of. I hope it's neither ;). > I actually feel like we are quite far away from each other in our > proposals on this topic. I don't think so. We are trying to solve the same problems, that doesn't leave much room. Because of that, the proposals are mainly different in their formulation only. Of which I prefer the more generic, because it leaves more options open, while permitting to do the same (or near) optimizations. > Probably at least in part because I don't > understand how to implement your proposal with the performance > constraints that I would like to have. I think this is because we went too far discussing what are the alleged problems before agreeing on the use cases. What are the performance constraints? > My proposed approach definitely puts a lot more responsibility on the > application, but I think this is a necessity due to limitations in > what we *can* implement with reasonable performance. What exactly do you intend to provide for applications in the API? > And I also think > that the amount of use cases you end up being able to support with > more complex filtering mechanisms is actually relatively small. Let's enumerate those use cases, based on the existing proposals. I have the feeling you impose it to be small, confining the platforms on that constraint. There are apps which will need complex filtering anyway, see the examples above. Still, since we don't know what are the use cases of future apps, specifying with generic filters and narrowing the scope from documentation would be better. IMO it would allow pretty good performance too, given that we solve the result set iteration properly. Best regards, Zoltan
Received on Friday, 18 January 2013 14:33:03 UTC