Re: Public SPARQL endpoints:managing (mis)-use and communicating limits to users. from Andrea Splendiani on 2013-04-18 (public-lod@w3.org from April 2013)

From: Andrea Splendiani <andrea.splendiani@iscb.org>
Date: Thu, 18 Apr 2013 14:23:49 +0100
To: Jerven Bolleman <jerven.bolleman@isb-sib.ch>
Cc: public-lod@w3.org
Message-Id: <4D35E7E7-902C-4284-B22E-A453C71154FF@iscb.org>

Hi,

I think that some caching with a minimum of query rewriting would get read of 90% of the select{?s ?p ?o} where {?s?p ?o} queries.

From a user perspective, I would rather have a clear result code upfront telling me: your query is to heavy, not enough resources and so on, than partial results + extra codes. I won't do much of partial results anyway... so it's time wasted both sides.

One empiric solution could be to assign a quota per requesting IP (or other form of identification). Then one could restrict the total amount of resource per time-frame, possibly with smart policies. It would also avoid people breaking big queries in many small ones...

But I was wondering: why is resource consumption a problem for sparql endpoint providers, and not for other "providers" on the web ? (say, YouTube, Google, ...).
Is it the unpredictability of the resources needed ? 

best,
Andrea

Il giorno 18/apr/2013, alle ore 12:53, Jerven Bolleman <jerven.bolleman@isb-sib.ch> ha scritto:

> Hi All,
> 
> Managing a public SPARQL endpoint has some difficulties in comparison to managing a simpler REST api.
> Instead of counting api calls or external bandwidth use we need to look at internal IO and CPU usage as well.
> 
> Many of the current public SPARQL endpoints limit all their users to queries of limited CPU time.
> But this is not enough to really manage (mis) use of an endpoint. Also the SPARQL api being http based
> suffers from the problem that we first send the status code and may only find out later that we can't
> answer the query after all. Leading to a 200 not OK problem :(
> 
> What approaches can we come up with as a community to embedded resource limit exceeded exceptions in the 
> SPARQL protocols. e.g. we could add an exception element to the sparql xml result format.[1]
> 
> The current limits to CPU use are not enough to really avoid misuse. Which is why I submitted a patch to
> Sesame that allows limits on memory use as well. Although limits on disk seeks or other IO counts may be needed by some as well.
> 
> But these are currently hard limits what I really want is 
> "playground limits" i.e. you can use the swing as much as you want if you are the only child in the park. 
> Once there are more children you have to share. 
> 
> And how do we communicate this to our users. i.e. this result set is incomplete because you exceeded your IO
> quota please break up your queries in smaller blocks. 
> 
> For my day job where I do manage a 7.4 billion triple store with public access some extra tools in managing users would be 
> great.
> 
> Last but not least how can we avoid that users need to run SELECT (COUNT(DISTINT(?s) as ?sc} WHERE {?s ?p ?o} and friends.
> For beta.sparql.uniprot.org I have been moving much of this information into the sparql endpoint description but its not a place
> where people look for this information.
> 
> Regards,
> Jerven
> 
> [1] Yeah these ideas are not great timing just after 1.1 but we can always start SPARQL 1.2 ;)
> 
> 
> 
> -------------------------------------------------------------------
> Jerven Bolleman                        Jerven.Bolleman@isb-sib.ch
> SIB Swiss Institute of Bioinformatics      Tel: +41 (0)22 379 58 85
> CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
> 1211 Geneve 4,
> Switzerland     www.isb-sib.ch - www.uniprot.org
> Follow us at https://twitter.com/#!/uniprot
> -------------------------------------------------------------------
> 
>

Received on Thursday, 18 April 2013 13:24:43 UTC