Re: Public SPARQL endpoints:managing (mis)-use and communicating limits to users. from Kingsley Idehen on 2013-04-18 (public-lod@w3.org from April 2013)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Thu, 18 Apr 2013 13:11:15 -0400
To: public-lod@w3.org
Message-ID: <51702933.9020500@openlinksw.com>

On 4/18/13 12:07 PM, Alan Ruttenberg wrote:
>
>
>
> On Thu, Apr 18, 2013 at 11:55 AM, Jerven Bolleman 
> <jerven.bolleman@isb-sib.ch <mailto:jerven.bolleman@isb-sib.ch>> wrote:
>
>     Hi Alan,
>     On Apr 18, 2013, at 5:33 PM, Alan Ruttenberg wrote:
>
>     >
>     > On Thu, Apr 18, 2013 at 7:53 AM, Jerven Bolleman
>     <jerven.bolleman@isb-sib.ch <mailto:jerven.bolleman@isb-sib.ch>>
>     wrote:
>     > >Last but not least how can we avoid that users need to run
>     SELECT (COUNT(DISTINT(?s) as ?sc} WHERE {?s ?p ?o} and friends.
>     > It's always rather disappointing to me that basic queries like
>     this aren't very fast. I remember that we had a stored procedure
>     for listing the predicates used in the store. It ran in a fraction
>     of a second, while the straightforward query took ages.
>     Its a good point and currently they do run to slow. The problem is
>     the DISTINCT, these are hard to optimize away, even in current
>     RDMBS these take time. Everyone has been busy on making SPARQL 1.1
>     work that optimizations have taken a step back for a while.
>     > I am interested in why queries like this are not optimized.
>     Seems to me that this should be straightforward to optimize by
>     looking at index structures.
>     Depends very much on your index structures. And even then you have
>     to traverse your entire index.
>     So lets say that for UniProt this query can be fully answered by
>     scanning only a SPOC index. That index is 40GB large.
>     A single HD drives data through at 200MB/s so that will still take
>     200 seconds at best.[1]
>
>
> Not if the index is updated with a count field whenever there are 
> inserts. This would be a matter for virtuoso to implement.
>
>
>     Currently many implementations do not detect that this can be
>     answered by only doing an index count and does not required
>     materialization
>     of the triple patterns. So you end up putting every ?s into a set
>     after which the count operation is done. This is horrifyingly
>     expensive 
>
>     for a dataset like UniProt with billions of ?subjects. Even old
>     fashioned unix sort -u takes ages here.
>
>
> I know. As I indicate the responsibility for implementation of this 
> lies with the triple store vendors. Kingsley?

Yes, it does lie with us as engine developers, and that's why we 
implemented the "anytime query" feature years ago with regards to 
challenges posed by public endpoints that are accessible to anyone 
(human or machine). The goal being to always provide an solution be it 
complete or partial.

We are in the middle of the official V 7.0 release rollout. Once done, 
we'll do the following:

1. provide an update to material covered in: 
http://neurocommons.org/page/Virtuoso_performance
2. add and update: http://bit.ly/17rYNT3 -- which holds some aggregate 
query results for execution without the limits we apply to the public 
instance
3. shed light on the options for dealing with this matter at the public 
endpoint level, beyond what we already do re. anytime query feature -- 
where you will get a partial results subject to "fair use" limits that 
we apply to our public endpoints.



-- 

Regards,

Kingsley Idehen 
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca handle: @kidehen
Google+ Profile: https://plus.google.com/112399767740508618350/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen

Attachments

application/pkcs7-signature attachment: S/MIME Cryptographic Signature

Received on Thursday, 18 April 2013 17:11:42 UTC