Re: Public SPARQL endpoints:managing (mis)-use and communicating limits to users.

Hi,

Il giorno 18/apr/2013, alle ore 15:21, Jerven Bolleman <jerven.bolleman@isb-sib.ch> ha scritto:
>> I think that some caching with a minimum of query rewriting would get read of 90% of the select{?s ?p ?o} where {?s?p ?o} queries.
> We have some caching on the uniprot side. But as all queries are nearly unique result caching does not seem to work that well.
> And that assumes the first query will ever return to end up in the cache.
> Also users write the same query in different ways i.e.
>> select DISTINCT(?type) where {[] a ?type}
> or
>> select DISTINCT(?type) where {?s rdf:type ?type}
> 
> which leads to different requests and a a cache miss.
That's why you need a minimum of query rewriting: at least to trim spaces, normalize variables and syntax.


>> From a user perspective, I would rather have a clear result code upfront telling me: your query is to heavy, not enough resources and so on, than partial results + extra codes. I won't do much of partial results anyway... so it's time wasted both sides.
> Its good for the exploratory use cases. And its really hard to predict how much resources answering a query will take.
For exploratory use cases, you usually use LIMIT anyway. Again, from a user perspective, I would prefer to stay on the safe side respect to query results (this is a bit closed-world, but that's what you expect if you query an endpoint).

>> One empiric solution could be to assign a quota per requesting IP (or other form of identification). Then one could restrict the total amount of resource per time-frame, possibly with smart policies. It would also avoid people breaking big queries in many small ones...
> Exactly, that would be a good way to do it.
>> 
>> But I was wondering: why is resource consumption a problem for sparql endpoint providers, and not for other "providers" on the web ? (say, YouTube, Google, ...).
>> Is it the unpredictability of the resources needed ? 
> Yes, one small query can give huge loads. Google and youtube do not allow you to ask these kinds of queries (blanket ban really). 
> Its still a problem for them anyway, that is why they do have crawl&query limits. However, any query to Google is practically the same CPU cost.
> That is because google only allows you to ask simple queries. i.e. "find hotels on Bali" while the SPARQL user can ask
> "hi SPARQL find me hotels on bali that are around price ?X within 25 minutes of the airport but outside the noise perimeter of that airport (check some municipal website)
> has a michelin rated kitchen (so check the michelin website for me too) and was recommended by at least 3 of my friends friends (please visit Facebook etc...)".
> A much harder question, but one that will lead to a better holiday recommendation.   
Ok, so unpredictability (coupled with resource need) is the issue. Indeed there are not many "open resource" equivalent around, people usually constraints the way you access data (or allows you to download them).

> This type of question is essential for doing data analysis and actually getting the value out of BigData instead of just the volume ;) Because SPARQL is about more than finding 
> information its about querying to generate (NEW) information.
I'm curious how many people think about sparql this way, as opposed to, say, something like sql (from which you won't expect much). Another view is that some other analytic layer could be deployed on top of sparql, although embedding it in sparql makes more sense, imho.

> Also as a data provider I want people to ask interesting(complicated) questions on our data and make a best effort to answer them.
> The current approach of "here you go take this 1/4 TB data dump run it on your own infra" is just not affordable for small research labs (or even large pharma).
> This classic approach does stopped scaling for the bio research community, everyone maintaining their own copy of X public resources is not affordable.
At some point this lead to a discussion of whom "pays" for services. So far Linked (Open) Data is very good for public resources, as it's basically like an infrastructure. But in general, you cannot expect everything to be state founded.


> A rest interface that we have been providing for basically the last 20 years is not enough (yes http access to Swiss-Prot is really old, we where the first webserver
> on the web which used the image tag!). And we were big data when big data was still anything that took more than 3 320KB floppy disks.
:)
I think "big data" should be defined in relative terms. Any take on a definition ?

ciao,
Andrea


> 
> Regards,
> Jerven
> 
>> 
>> best,
>> Andrea
>> 
>> Il giorno 18/apr/2013, alle ore 12:53, Jerven Bolleman <jerven.bolleman@isb-sib.ch> ha scritto:
>> 
>>> Hi All,
>>> 
>>> Managing a public SPARQL endpoint has some difficulties in comparison to managing a simpler REST api.
>>> Instead of counting api calls or external bandwidth use we need to look at internal IO and CPU usage as well.
>>> 
>>> Many of the current public SPARQL endpoints limit all their users to queries of limited CPU time.
>>> But this is not enough to really manage (mis) use of an endpoint. Also the SPARQL api being http based
>>> suffers from the problem that we first send the status code and may only find out later that we can't
>>> answer the query after all. Leading to a 200 not OK problem :(
>>> 
>>> What approaches can we come up with as a community to embedded resource limit exceeded exceptions in the 
>>> SPARQL protocols. e.g. we could add an exception element to the sparql xml result format.[1]
>>> 
>>> The current limits to CPU use are not enough to really avoid misuse. Which is why I submitted a patch to
>>> Sesame that allows limits on memory use as well. Although limits on disk seeks or other IO counts may be needed by some as well.
>>> 
>>> But these are currently hard limits what I really want is 
>>> "playground limits" i.e. you can use the swing as much as you want if you are the only child in the park. 
>>> Once there are more children you have to share. 
>>> 
>>> And how do we communicate this to our users. i.e. this result set is incomplete because you exceeded your IO
>>> quota please break up your queries in smaller blocks. 
>>> 
>>> For my day job where I do manage a 7.4 billion triple store with public access some extra tools in managing users would be 
>>> great.
>>> 
>>> Last but not least how can we avoid that users need to run SELECT (COUNT(DISTINT(?s) as ?sc} WHERE {?s ?p ?o} and friends.
>>> For beta.sparql.uniprot.org I have been moving much of this information into the sparql endpoint description but its not a place
>>> where people look for this information.
>>> 
>>> Regards,
>>> Jerven
>>> 
>>> [1] Yeah these ideas are not great timing just after 1.1 but we can always start SPARQL 1.2 ;)
>>> 
>>> 
>>> 
>>> -------------------------------------------------------------------
>>> Jerven Bolleman                        Jerven.Bolleman@isb-sib.ch
>>> SIB Swiss Institute of Bioinformatics      Tel: +41 (0)22 379 58 85
>>> CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
>>> 1211 Geneve 4,
>>> Switzerland     www.isb-sib.ch - www.uniprot.org
>>> Follow us at https://twitter.com/#!/uniprot
>>> -------------------------------------------------------------------
>>> 
>>> 
>> 
> 
> -------------------------------------------------------------------
> Jerven Bolleman                        Jerven.Bolleman@isb-sib.ch
> SIB Swiss Institute of Bioinformatics      Tel: +41 (0)22 379 58 85
> CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
> 1211 Geneve 4,
> Switzerland     www.isb-sib.ch - www.uniprot.org
> Follow us at https://twitter.com/#!/uniprot
> -------------------------------------------------------------------
> 

Received on Thursday, 18 April 2013 15:03:45 UTC