- From: Kingsley Idehen <kidehen@openlinksw.com>
- Date: Sun, 06 Mar 2011 09:53:34 -0500
- To: public-lod@w3.org
On 3/6/11 7:56 AM, Richard Cyganiak wrote: > On 6 Mar 2011, at 12:16, Christopher Gutteridge wrote: >> Talk of how many triples are in a store puts me in mind of this quote >> "Measuring programming progress by lines of code is like measuring aircraft building progress by weight." > Well, but you know that quality on the Web of Data is measured in million triples! ;-) > > Jokes aside, as long as triple store performance is a frequent limiting factor, triple counts are important. > > “We can't load that dataset, it would be another 200MT, this would kill our store” > “Their dataset is only 100kT, so how come their endpoint is so slow?” > “Well if you have a million triples then you should be ok with any of the major stores on the hardware you already have.” > “Given the load rate we typically get on our store, loading this dataset should take till tuesday.” > “Wow, this new dataset increases the total number of triples in the LOD Cloud by 3%!” > > You might object to some, but surely not all, of these uses of triple counts. > >> there's very few webmasters out there willing to do extra work just so we can make pretty graphs. > Aside: As a maker of pretty graphs, I can tell you that you would be surprised. > > Enjoy your Sunday! > > Richard In addition to the above, smart SPARQL-FED [1] isn't achievable without good stats about SPARQL endpoints. Locality aware cost optimization is very dependent on metadata [2] gleaned from remote data sources associated with a SPARQL endpoint. What's good for SQL is well and truly good for SPARQL re. data virtualization, assuming Triple/Quad stores are a sub-category of DBMS. We can leverage voID when making SPARQL endpoint description metadata. It's actually very important from a pragmatic view point, especially if we truly believe in the crystallization of the Web as a Global Data Space. I don't expect users or Web developers to write SPARQL-FED, but I do expect them to assume and/or demand the Linked Data experience that SPARQL-FED, SPARQL Endpoint Metadata, and voID facilitate. Links: 1. http://www.w3.org/TR/sparql-features/#Basic_federated_query - SPARQL-FED 2. http://www.w3.org/TR/sparql-features/#Service_description -- SPARQL endpoint metadata. Kingsley > >> >> >> Ian Davis wrote: >>> Is the number of triples that important? With all respect to the >>> people on this list, I think there's a tendency to obsess over triple >>> counts. Aren't we past that bootstrap phase of being awed when we see >>> millions of triples being produced? I thought we'd moved towards >>> being more focussed on quality and utility of data than sheer numbers? >>> >>> Besides, for me the most interesting datasets are those that are >>> continually changing as they reflect the real world and I'd like to >>> see us work towards metrics for freshness and coverage. >>> >>> >>> On Sun, Mar 6, 2011 at 11:20 AM, Tim Berners-Lee >>> <timbl@w3.org> >>> wrote: >>> >>> >>>> Maybe the count of triples should be special-cased in the sparql server code, >>>> spotted on input and the store size returned. >>>> if it is reasonable for the endpoint to keep track of the size of its store. >>>> (Do they anyway?) >>>> >>>> Tim >>>> >>>> On 2011-03 -05, at 11:58, Bill Roberts wrote: >>>> >>>> >>>> >>>>> Thanks Hugh - as someone running a couple of SPARQL endpoints, I'd certainly prefer if people don't run a global count too often (or at all). It is indeed something that makes typical SPARQL implementations work very hard. >>>>> >>>>> But it's a good reminder we should provide an alternative and i'll look into providing triple counts in voiD. >>>>> >>>>> Bill >>>>> >>>>> >>>>> On 5 Mar 2011, at 15:14, Hugh Glaser wrote: >>>>> >>>>> >>>>> >>>>>> Hi, >>>>>> On 5 Mar 2011, at 14:22, Andrea Splendiani wrote: >>>>>> >>>>>> >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I think it depends on the store, I've tried some (from the endpoint list) and some returns a answer pretty quickly. Some doesn't and some doesn't support count. >>>>>>> However, one could have this information only for the stores that answers the count query, no need to try all time. >>>>>>> >>>>>>> >>>>>> I am happy for a store implementor or owner to disagree, but I find it very unlikely that the owner of a store with a decent chunk of data (> 1M triples, say) would be happy for someone to keep issuing such a query, even if they did decide to give enough resources to execute it. >>>>>> I would quickly blacklist such a site. >>>>>> >>>>>> >>>>>>> VoID: >>>>>>> is this a good query: >>>>>>> select * where {?s >>>>>>> <http://rdfs.org/ns/void#numberOfTriples> >>>>>>> ?o } >>>>>>> >>>>>>> >>>>>> I'm no SPARQL or voiD guru, but I think you need a bit more wrapping in the scovo stuff, so more like: >>>>>> >>>>>> SELECT DISTINCT ?endpoint ?uri ?triples ?uris WHERE >>>>>> { ?ds a void:Dataset . >>>>>> ?ds void:sparqlEndpoint ?uri . >>>>>> ?ds rdfs:label ?endpoint . >>>>>> ?ds void:statItem [ scovo:dimension void:numberOfTriples ; rdf:value ?triples ] . >>>>>> } >>>>>> >>>>>> Try it at >>>>>> >>>>>> http://kwijibo.talis.com/voiD/ >>>>>> >>>>>> or >>>>>> >>>>>> http://void.rkbexplorer.com/ >>>>>> >>>>>> >>>>>> I guess Pierre-Yves might like to enhance his page by querying a voiD store to also give basic stats. >>>>>> Or someone might like to do a store reporter that uses (a) voiD endpoint(s) plus Pierre-Yves's data (he has a SPARQL endpoint), to do so. >>>>>> And maybe the CKAN endpoint would have extra useful data as well. >>>>>> A real Semantic Web application that queried more than one SPARQL endpoint - now that would be a novelty! >>>>>> Fancy the challenge, it is the weekend?! :-) >>>>>> >>>>>> ciao >>>>>> Hugh >>>>>> >>>>>> >>>>>> >>>>>>> it doesn't seem viable if so. >>>>>>> >>>>>>> ciao, >>>>>>> Andrea >>>>>>> >>>>>>> >>>>>>> Il giorno 05/mar/2011, alle ore 13.49, Hugh Glaser ha scritto: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> NIce idea, but,... :-) >>>>>>>> >>>>>>>> SELECT (count(*) as ?c) WHERE {?s ?p ?o} >>>>>>>> >>>>>>>> is a pretty anti-social thing to do to a store. >>>>>>>> At best, a store of any size will spend a while thinking, and then quite rightly decide they have burnt enough resources, and return some sort of error. >>>>>>>> >>>>>>>> For a properly maintained site, of course, the VoiD description will give lots of similar information. >>>>>>>> Best >>>>>>>> Hugh >>>>>>>> >>>>>>>> On 5 Mar 2011, at 13:06, Andrea Splendiani wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Hi, very nice! >>>>>>>>> I have a small suggestion: >>>>>>>>> >>>>>>>>> why don't you ask "count(*) where {?s ?p ?o}" to the endpoint ? >>>>>>>>> Or ask for the number of graphs ? >>>>>>>>> Both information, number of triples and number of graphs, if logged and compared over time, can give a practical view of the liveliness of the content of the endpoint. >>>>>>>>> >>>>>>>>> best, >>>>>>>>> Andrea Splendiani >>>>>>>>> >>>>>>>>> >>>>>>>>> Il giorno 28/feb/2011, alle ore 18.55, Pierre-Yves Vandenbussche ha scritto: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> Hello all, >>>>>>>>>> >>>>>>>>>> you have already encountered problems of SPARQL endpoint accessibility ? >>>>>>>>>> you feel frustrated they are never available when you need them? >>>>>>>>>> you develop an application using these services but wonder if it is reliable? >>>>>>>>>> >>>>>>>>>> Here is a tool [1] that allows you to know public SPARQL endpoints availability and monitor them in the last hours/days. >>>>>>>>>> Stay informed of a particular (or all) endpoint status changes through RSS feeds. >>>>>>>>>> All availability information generated by this tool is accessible through a SPARQL endpoint. >>>>>>>>>> >>>>>>>>>> This tool fetches public SPARQL endpoints from CKAN open data. From this list, it runs tests every hour for availability. >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> http://labs.mondeca.com/sparqlEndpointsStatus/index.html >>>>>>>>>> >>>>>>>>>> [2] >>>>>>>>>> http://ckan.net/ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Pierre-Yves Vandenbussche. >>>>>>>>>> >>>>>>>>>> >>>>>>>>> Andrea Splendiani >>>>>>>>> Senior Bioinformatics Scientist >>>>>>>>> Centre for Mathematical and Computational Biology >>>>>>>>> +44(0)1582 763133 ext 2004 >>>>>>>>> >>>>>>>>> andrea.splendiani@bbsrc.ac.uk >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> -- >>>>>>>> Hugh Glaser, >>>>>>>> Intelligence, Agents, Multimedia >>>>>>>> School of Electronics and Computer Science, >>>>>>>> University of Southampton, >>>>>>>> Southampton SO17 1BJ >>>>>>>> Work: +44 23 8059 3670, Fax: +44 23 8059 3045 >>>>>>>> Mobile: +44 78 9422 3822, Home: +44 23 8061 5652 >>>>>>>> >>>>>>>> http://www.ecs.soton.ac.uk/~hg/ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> Andrea Splendiani >>>>>>> Senior Bioinformatics Scientist >>>>>>> Centre for Mathematical and Computational Biology >>>>>>> +44(0)1582 763133 ext 2004 >>>>>>> >>>>>>> andrea.splendiani@bbsrc.ac.uk >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> -- >>>>>> Hugh Glaser, >>>>>> Intelligence, Agents, Multimedia >>>>>> School of Electronics and Computer Science, >>>>>> University of Southampton, >>>>>> Southampton SO17 1BJ >>>>>> Work: +44 23 8059 3670, Fax: +44 23 8059 3045 >>>>>> Mobile: +44 78 9422 3822, Home: +44 23 8061 5652 >>>>>> >>>>>> http://www.ecs.soton.ac.uk/~hg/ >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> -- >> Christopher Gutteridge -- >> http://id.ecs.soton.ac.uk/person/1248 >> >> >> You should read the ECS Web Team blog: >> http://blogs.ecs.soton.ac.uk/webteam/ > > -- Regards, Kingsley Idehen President& CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen
Received on Sunday, 6 March 2011 14:54:04 UTC