Re: Public SPARQL endpoints:managing (mis)-use and communicating limits to users. from Jerven Bolleman on 2013-04-18 (public-lod@w3.org from April 2013)

From: Jerven Bolleman <jerven.bolleman@isb-sib.ch>
Date: Thu, 18 Apr 2013 17:55:20 +0200
To: Alan Ruttenberg <alanruttenberg@gmail.com>
Cc: "public-lod@w3.org" <public-lod@w3.org>
Message-Id: <7B570748-622D-4248-819A-EC842A96CC45@isb-sib.ch>

Hi Alan,
On Apr 18, 2013, at 5:33 PM, Alan Ruttenberg wrote:

> 
> On Thu, Apr 18, 2013 at 7:53 AM, Jerven Bolleman <jerven.bolleman@isb-sib.ch> wrote:
> >Last but not least how can we avoid that users need to run SELECT (COUNT(DISTINT(?s) as ?sc} WHERE {?s ?p ?o} and friends.
> It's always rather disappointing to me that basic queries like this aren't very fast. I remember that we had a stored procedure for listing the predicates used in the store. It ran in a fraction of a second, while the straightforward query took ages.
Its a good point and currently they do run to slow. The problem is the DISTINCT, these are hard to optimize away, even in current
RDMBS these take time. Everyone has been busy on making SPARQL 1.1 work that optimizations have taken a step back for a while.
> I am interested in why queries like this are not optimized. Seems to me that this should be straightforward to optimize by looking at index structures.
Depends very much on your index structures. And even then you have to traverse your entire index. 
So lets say that for UniProt this query can be fully answered by scanning only a SPOC index. That index is 40GB large.
A single HD drives data through at 200MB/s so that will still take 200 seconds at best.[1] 

Currently many implementations do not detect that this can be answered by only doing an index count and does not required materialization
of the triple patterns. So you end up putting every ?s into a set after which the count operation is done. This is horrifyingly expensive
for a dataset like UniProt with billions of ?subjects. Even old fashioned unix sort -u takes ages here.

Regards,
Jerven
> Rather than struggling to have users avoid basic, useful queries, how about making them work well.
> 
> As use evolves, people reach a level where they do need to be cognizant of how queries are run. At that point, there's not a simple way to say which queries to avoid. 
> 
> The most useful tools to have are those that expose query plans as clearly as possible, highlight which parts of them are taking lots of time, and have a reference page that helps people configure their database, or reformulate queries to address the execution problems that arise. A first step towards this, if you are using virtuoso, is to always ask for the query cost and display it with a link to ask for the query plan. With a little more work you can speculatively run the query for a bit and if it times out, with the error message display (or provide in the error message) the query plan as discussed above. If you want to give your users a little more control and think they will take advantage of it, you could add some way for them to say their guess of whether the query is easy, moderate, or hard, and allocate time to the query appropriately (e.g have buttons/services easy, moderate, or hard in place of a single execute query button). 
> 
> Here's a couple of pages we had compiled about performance. I expect they are out of date as we haven't tended to them in a few years, but perhaps they will be of use to someone.
> 
> http://neurocommons.org/page/Virtuoso_performance
> 
[1] Please check my maths its been a long day.

-------------------------------------------------------------------
Jerven Bolleman                        Jerven.Bolleman@isb-sib.ch
SIB Swiss Institute of Bioinformatics      Tel: +41 (0)22 379 58 85
CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
1211 Geneve 4,
Switzerland     www.isb-sib.ch - www.uniprot.org
Follow us at https://twitter.com/#!/uniprot
-------------------------------------------------------------------

Received on Thursday, 18 April 2013 15:55:52 UTC