Re: Public SPARQL endpoints:managing (mis)-use and communicating limits to users.

On 4/18/13 9:23 AM, Andrea Splendiani wrote:
> Hi,
>
> I think that some caching with a minimum of query rewriting would get read of 90% of the select{?s ?p ?o} where {?s?p ?o} queries.

Sorta.

Client queries are inherently unpredictable. That's always been the 
case, and that predates SPARQL. These issues also exist in the SQL RDBMS 
realm, which is why you don't have SQL endpoints delivering what SPARQL 
endpoints provide.

>
>  From a user perspective, I would rather have a clear result code upfront telling me: your query is to heavy, not enough resources and so on, than partial results + extra codes.

Yes, and you get that in some solutions e.g., what we provide. 
Basically, our server (subject to capacity) will tell you immediately 
that your query exceeds the query cost limits (this is different from 
timeout limits). The aforementioned feature was critical to getting the 
DBpedia SPARQL endpoint going, years ago.

> I won't do much of partial results anyway... so it's time wasted both sides.

Not in a world where you have a public endpoint and zero control over 
the queries issued by clients.
Not in a world where you to provide faceted navigation over entity 
relations as part of a "precision find" style service atop RDF based 
Linked Data etc..

>
> One empiric solution could be to assign a quota per requesting IP (or other form of identification).

That's but one coarse-grained factor. You need to be able to associate a 
user agent (human or machine) profile with what ever quality of service 
you seek to scope to said profile. Again, this is the kind of thing we 
offer by leveraging WebID, Inference, and RDF right inside the core DBMS 
engine.

>   Then one could restrict the total amount of resource per time-frame, possibly with smart policies.

"Smart Policies" are the kind of thing you produce by exploiting the 
kind or entity relationship semantics baked into RDF based Linked Data. 
Basically, OWL (which is all about describing entity types and relation 
types semantics) serves this purpose very well. We certainly put it to 
use in our data access policy system which enables us to offer different 
capabilities and resource consumption to different human- or 
machine-agent profiles.

> It would also avoid people breaking big queries in many small ones...

You can't avoid bad or challenging queries. What you can do is look to 
fine-grained data access policies (semantically enhanced ACLs) to 
address this problem. This has always been the challenge, even before 
the emergence of the whole Semantic Web , RDF etc.. The same challenges 
also dogged the RDBMS realm. There is no dancing around this matter when 
dealing with traditional RDBMS or Web oriented data access.
>
> But I was wondering: why is resource consumption a problem for sparql endpoint providers, and not for other "providers" on the web ? (say, YouTube, Google, ...).
> Is it the unpredictability of the resources needed ?

Good question!

They hide the problem behind airport sized data centers, and then they 
get you to foot the bill via your profile data which ultimately 
compromises your privacy.

This is a problem, and it's the ultimately basis for showcasing what RDF 
(entity relationship based data model endowed with *explicit* rather 
than *implicit* human- and machine-readable entity relationship 
semantics) is actually all about.


Kingsley
>
> best,
> Andrea
>
> Il giorno 18/apr/2013, alle ore 12:53, Jerven Bolleman <jerven.bolleman@isb-sib.ch> ha scritto:
>
>> Hi All,
>>
>> Managing a public SPARQL endpoint has some difficulties in comparison to managing a simpler REST api.
>> Instead of counting api calls or external bandwidth use we need to look at internal IO and CPU usage as well.
>>
>> Many of the current public SPARQL endpoints limit all their users to queries of limited CPU time.
>> But this is not enough to really manage (mis) use of an endpoint. Also the SPARQL api being http based
>> suffers from the problem that we first send the status code and may only find out later that we can't
>> answer the query after all. Leading to a 200 not OK problem :(
>>
>> What approaches can we come up with as a community to embedded resource limit exceeded exceptions in the
>> SPARQL protocols. e.g. we could add an exception element to the sparql xml result format.[1]
>>
>> The current limits to CPU use are not enough to really avoid misuse. Which is why I submitted a patch to
>> Sesame that allows limits on memory use as well. Although limits on disk seeks or other IO counts may be needed by some as well.
>>
>> But these are currently hard limits what I really want is
>> "playground limits" i.e. you can use the swing as much as you want if you are the only child in the park.
>> Once there are more children you have to share.
>>
>> And how do we communicate this to our users. i.e. this result set is incomplete because you exceeded your IO
>> quota please break up your queries in smaller blocks.
>>
>> For my day job where I do manage a 7.4 billion triple store with public access some extra tools in managing users would be
>> great.
>>
>> Last but not least how can we avoid that users need to run SELECT (COUNT(DISTINT(?s) as ?sc} WHERE {?s ?p ?o} and friends.
>> For beta.sparql.uniprot.org I have been moving much of this information into the sparql endpoint description but its not a place
>> where people look for this information.
>>
>> Regards,
>> Jerven
>>
>> [1] Yeah these ideas are not great timing just after 1.1 but we can always start SPARQL 1.2 ;)
>>
>>
>>
>> -------------------------------------------------------------------
>> Jerven Bolleman                        Jerven.Bolleman@isb-sib.ch
>> SIB Swiss Institute of Bioinformatics      Tel: +41 (0)22 379 58 85
>> CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
>> 1211 Geneve 4,
>> Switzerland     www.isb-sib.ch - www.uniprot.org
>> Follow us at https://twitter.com/#!/uniprot
>> -------------------------------------------------------------------
>>
>>
>
>
>
>


-- 

Regards,

Kingsley Idehen	
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca handle: @kidehen
Google+ Profile: https://plus.google.com/112399767740508618350/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen

Received on Thursday, 18 April 2013 15:04:47 UTC