Re: The truth about SPARQL Endpoint availability from Kingsley Idehen on 2011-03-06 (public-lod@w3.org from March 2011)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Sun, 06 Mar 2011 09:53:34 -0500
To: public-lod@w3.org
Message-ID: <4D739FEE.9050900@openlinksw.com>
On 3/6/11 7:56 AM, Richard Cyganiak wrote:
> On 6 Mar 2011, at 12:16, Christopher Gutteridge wrote:
>> Talk of how many triples are in a store puts me in mind of this quote
>>      "Measuring programming progress by lines of code is like measuring aircraft building progress by weight."
> Well, but you know that quality on the Web of Data is measured in million triples! ;-)
>
> Jokes aside, as long as triple store performance is a frequent limiting factor, triple counts are important.
>
> “We can't load that dataset, it would be another 200MT, this would kill our store”
> “Their dataset is only 100kT, so how come their endpoint is so slow?”
> “Well if you have a million triples then you should be ok with any of the major stores on the hardware you already have.”
> “Given the load rate we typically get on our store, loading this dataset should take till tuesday.”
> “Wow, this new dataset increases the total number of triples in the LOD Cloud by 3%!”
>
> You might object to some, but surely not all, of these uses of triple counts.
>
>> there's very few webmasters out there willing to do extra work just so we can make pretty graphs.
> Aside: As a maker of pretty graphs, I can tell you that you would be surprised.
>
> Enjoy your Sunday!
>
> Richard

In addition to the above, smart SPARQL-FED [1] isn't achievable without 
good stats about SPARQL endpoints. Locality aware cost optimization is 
very dependent on metadata [2] gleaned from remote data sources 
associated with a SPARQL endpoint. What's good for SQL is well and truly 
good for SPARQL re. data virtualization, assuming Triple/Quad stores are 
a sub-category of DBMS. We can leverage voID when making SPARQL endpoint 
description metadata. It's actually very important from a pragmatic view 
point, especially if we truly believe in the crystallization of the Web 
as a Global Data Space.

I don't expect users or Web developers to write SPARQL-FED, but I do 
expect them to assume and/or demand the Linked Data experience that 
SPARQL-FED, SPARQL Endpoint Metadata, and voID facilitate.

Links:

1. http://www.w3.org/TR/sparql-features/#Basic_federated_query - SPARQL-FED
2. http://www.w3.org/TR/sparql-features/#Service_description -- SPARQL 
endpoint metadata.

Kingsley
>
>>
>>
>> Ian Davis wrote:
>>> Is the number of triples that important? With all respect to the
>>> people on this list, I think there's a tendency to obsess over triple
>>> counts. Aren't we past that bootstrap phase of being awed when we see
>>> millions of triples being produced?  I thought we'd moved towards
>>> being more focussed on quality and utility of data than sheer numbers?
>>>
>>> Besides, for me the most interesting datasets are those that are
>>> continually changing as they reflect the real world and I'd like to
>>> see us work towards metrics for freshness and coverage.
>>>
>>>
>>> On Sun, Mar 6, 2011 at 11:20 AM, Tim Berners-Lee
>>> <timbl@w3.org>
>>>   wrote:
>>>
>>>
>>>> Maybe the count of triples should be special-cased in the sparql server code,
>>>> spotted on input and the store size returned.
>>>> if it is reasonable for the endpoint to keep track of the size of its store.
>>>> (Do they anyway?)
>>>>
>>>> Tim
>>>>
>>>> On 2011-03 -05, at 11:58, Bill Roberts wrote:
>>>>
>>>>
>>>>
>>>>> Thanks Hugh - as someone running a couple of SPARQL endpoints, I'd certainly prefer if people don't run a global count too often (or at all). It is indeed something that makes typical SPARQL implementations work very hard.
>>>>>
>>>>> But it's a good reminder we should provide an alternative and i'll look into providing triple counts in voiD.
>>>>>
>>>>> Bill
>>>>>
>>>>>
>>>>> On 5 Mar 2011, at 15:14, Hugh Glaser wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hi,
>>>>>> On 5 Mar 2011, at 14:22, Andrea Splendiani wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I think it depends on the store, I've tried some (from the endpoint list) and some returns a answer pretty quickly. Some doesn't and some doesn't support count.
>>>>>>> However, one could have this information only for the stores that answers the count query, no need to try all time.
>>>>>>>
>>>>>>>
>>>>>> I am happy for a store implementor or owner to disagree, but I find it very unlikely that the owner of a store with a decent chunk of data (>  1M triples, say) would be happy for someone to keep issuing such a query, even if they did decide to give enough resources to execute it.
>>>>>> I would quickly blacklist such a site.
>>>>>>
>>>>>>
>>>>>>> VoID:
>>>>>>> is this a good query:
>>>>>>> select * where {?s
>>>>>>> <http://rdfs.org/ns/void#numberOfTriples>
>>>>>>>   ?o }
>>>>>>>
>>>>>>>
>>>>>> I'm no SPARQL or voiD guru, but I think you need a bit more wrapping in the scovo stuff, so more like:
>>>>>>
>>>>>> SELECT DISTINCT ?endpoint ?uri ?triples ?uris WHERE
>>>>>>           { ?ds a void:Dataset .
>>>>>>             ?ds void:sparqlEndpoint ?uri .
>>>>>>             ?ds rdfs:label ?endpoint .
>>>>>>             ?ds void:statItem [ scovo:dimension void:numberOfTriples ; rdf:value  ?triples ] .
>>>>>>          }
>>>>>>
>>>>>> Try it at
>>>>>>
>>>>>> http://kwijibo.talis.com/voiD/
>>>>>>
>>>>>> or
>>>>>>
>>>>>> http://void.rkbexplorer.com/
>>>>>>
>>>>>>
>>>>>> I guess Pierre-Yves might like to enhance his page by querying a voiD store to also give basic stats.
>>>>>> Or someone might like to do a store reporter that uses (a) voiD endpoint(s) plus Pierre-Yves's data (he has a SPARQL endpoint), to do so.
>>>>>> And maybe the CKAN endpoint would have extra useful data as well.
>>>>>> A real Semantic Web application that queried more than one SPARQL endpoint - now that would be a novelty!
>>>>>> Fancy the challenge, it is the weekend?! :-)
>>>>>>
>>>>>> ciao
>>>>>> Hugh
>>>>>>
>>>>>>
>>>>>>
>>>>>>> it doesn't seem viable if so.
>>>>>>>
>>>>>>> ciao,
>>>>>>> Andrea
>>>>>>>
>>>>>>>
>>>>>>> Il giorno 05/mar/2011, alle ore 13.49, Hugh Glaser ha scritto:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> NIce idea, but,... :-)
>>>>>>>>
>>>>>>>> SELECT (count(*) as ?c) WHERE {?s ?p ?o}
>>>>>>>>
>>>>>>>> is a pretty anti-social thing to do to a store.
>>>>>>>> At best, a store of any size will spend a while thinking, and then quite rightly decide they have burnt enough resources, and return some sort of error.
>>>>>>>>
>>>>>>>> For a properly maintained site, of course, the VoiD description will give lots of similar information.
>>>>>>>> Best
>>>>>>>> Hugh
>>>>>>>>
>>>>>>>> On 5 Mar 2011, at 13:06, Andrea Splendiani wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi, very nice!
>>>>>>>>> I have a small suggestion:
>>>>>>>>>
>>>>>>>>> why don't you ask "count(*) where {?s ?p ?o}" to the endpoint ?
>>>>>>>>> Or ask for the number of graphs ?
>>>>>>>>> Both information, number of triples and number of graphs, if logged and compared over time, can give a practical view of the liveliness of the content of the endpoint.
>>>>>>>>>
>>>>>>>>> best,
>>>>>>>>> Andrea Splendiani
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Il giorno 28/feb/2011, alle ore 18.55, Pierre-Yves Vandenbussche ha scritto:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Hello all,
>>>>>>>>>>
>>>>>>>>>> you have already encountered problems of SPARQL endpoint accessibility ?
>>>>>>>>>> you feel frustrated they are never available when you need them?
>>>>>>>>>> you develop an application using these services but wonder if it is reliable?
>>>>>>>>>>
>>>>>>>>>> Here is a tool [1] that allows you to know public SPARQL endpoints availability and monitor them in the last hours/days.
>>>>>>>>>> Stay informed of a particular (or all) endpoint status changes through RSS feeds.
>>>>>>>>>> All availability information generated by this tool is accessible through a SPARQL endpoint.
>>>>>>>>>>
>>>>>>>>>> This tool fetches public SPARQL endpoints from CKAN  open data. From this list, it runs tests every hour for availability.
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> http://labs.mondeca.com/sparqlEndpointsStatus/index.html
>>>>>>>>>>
>>>>>>>>>> [2]
>>>>>>>>>> http://ckan.net/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Pierre-Yves Vandenbussche.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Andrea Splendiani
>>>>>>>>> Senior Bioinformatics Scientist
>>>>>>>>> Centre for Mathematical and Computational Biology
>>>>>>>>> +44(0)1582 763133 ext 2004
>>>>>>>>>
>>>>>>>>> andrea.splendiani@bbsrc.ac.uk
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>>>>> Hugh Glaser,
>>>>>>>>            Intelligence, Agents, Multimedia
>>>>>>>>            School of Electronics and Computer Science,
>>>>>>>>            University of Southampton,
>>>>>>>>            Southampton SO17 1BJ
>>>>>>>> Work: +44 23 8059 3670, Fax: +44 23 8059 3045
>>>>>>>> Mobile: +44 78 9422 3822, Home: +44 23 8061 5652
>>>>>>>>
>>>>>>>> http://www.ecs.soton.ac.uk/~hg/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Andrea Splendiani
>>>>>>> Senior Bioinformatics Scientist
>>>>>>> Centre for Mathematical and Computational Biology
>>>>>>> +44(0)1582 763133 ext 2004
>>>>>>>
>>>>>>> andrea.splendiani@bbsrc.ac.uk
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> Hugh Glaser,
>>>>>>              Intelligence, Agents, Multimedia
>>>>>>              School of Electronics and Computer Science,
>>>>>>              University of Southampton,
>>>>>>              Southampton SO17 1BJ
>>>>>> Work: +44 23 8059 3670, Fax: +44 23 8059 3045
>>>>>> Mobile: +44 78 9422 3822, Home: +44 23 8061 5652
>>>>>>
>>>>>> http://www.ecs.soton.ac.uk/~hg/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>> -- 
>> Christopher Gutteridge --
>> http://id.ecs.soton.ac.uk/person/1248
>>
>>
>> You should read the ECS Web Team blog:
>> http://blogs.ecs.soton.ac.uk/webteam/
>
>


-- 

Regards,

Kingsley Idehen	
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen
Received on Sunday, 6 March 2011 14:54:04 UTC