Re: Public SPARQL endpoints:managing (mis)-use and communicating limits to users.

On Thu, Apr 18, 2013 at 11:55 AM, Jerven Bolleman <
jerven.bolleman@isb-sib.ch> wrote:

> Hi Alan,
> On Apr 18, 2013, at 5:33 PM, Alan Ruttenberg wrote:
>
> >
> > On Thu, Apr 18, 2013 at 7:53 AM, Jerven Bolleman <
> jerven.bolleman@isb-sib.ch> wrote:
> > >Last but not least how can we avoid that users need to run SELECT
> (COUNT(DISTINT(?s) as ?sc} WHERE {?s ?p ?o} and friends.
> > It's always rather disappointing to me that basic queries like this
> aren't very fast. I remember that we had a stored procedure for listing the
> predicates used in the store. It ran in a fraction of a second, while the
> straightforward query took ages.
> Its a good point and currently they do run to slow. The problem is the
> DISTINCT, these are hard to optimize away, even in current
> RDMBS these take time. Everyone has been busy on making SPARQL 1.1 work
> that optimizations have taken a step back for a while.
> > I am interested in why queries like this are not optimized. Seems to me
> that this should be straightforward to optimize by looking at index
> structures.
> Depends very much on your index structures. And even then you have to
> traverse your entire index.
> So lets say that for UniProt this query can be fully answered by scanning
> only a SPOC index. That index is 40GB large.
> A single HD drives data through at 200MB/s so that will still take 200
> seconds at best.[1]
>

Not if the index is updated with a count field whenever there are inserts.
This would be a matter for virtuoso to implement.

>
> Currently many implementations do not detect that this can be answered by
> only doing an index count and does not required materialization
> of the triple patterns. So you end up putting every ?s into a set after
> which the count operation is done. This is horrifyingly expensive

for a dataset like UniProt with billions of ?subjects. Even old fashioned
> unix sort -u takes ages here.
>

I know. As I indicate the responsibility for implementation of this lies
with the triple store vendors. Kingsley?


>
> Regards,
> Jerven
> > Rather than struggling to have users avoid basic, useful queries, how
> about making them work well.
> >
> > As use evolves, people reach a level where they do need to be cognizant
> of how queries are run. At that point, there's not a simple way to say
> which queries to avoid.
> >
> > The most useful tools to have are those that expose query plans as
> clearly as possible, highlight which parts of them are taking lots of time,
> and have a reference page that helps people configure their database, or
> reformulate queries to address the execution problems that arise. A first
> step towards this, if you are using virtuoso, is to always ask for the
> query cost and display it with a link to ask for the query plan. With a
> little more work you can speculatively run the query for a bit and if it
> times out, with the error message display (or provide in the error message)
> the query plan as discussed above. If you want to give your users a little
> more control and think they will take advantage of it, you could add some
> way for them to say their guess of whether the query is easy, moderate, or
> hard, and allocate time to the query appropriately (e.g have
> buttons/services easy, moderate, or hard in place of a single execute query
> button).
> >
> > Here's a couple of pages we had compiled about performance. I expect
> they are out of date as we haven't tended to them in a few years, but
> perhaps they will be of use to someone.
> >
> > http://neurocommons.org/page/Virtuoso_performance
> >
> [1] Please check my maths its been a long day.
>
> -------------------------------------------------------------------
> Jerven Bolleman                        Jerven.Bolleman@isb-sib.ch
> SIB Swiss Institute of Bioinformatics      Tel: +41 (0)22 379 58 85
> CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
> 1211 Geneve 4,
> Switzerland     www.isb-sib.ch - www.uniprot.org
> Follow us at https://twitter.com/#!/uniprot
> -------------------------------------------------------------------
>
>

Received on Thursday, 18 April 2013 16:08:37 UTC