Re: [Ann] LODStats - Real-time Data Web Statistics from Rinke Hoekstra on 2012-02-21 (public-lod@w3.org from February 2012)

From: Rinke Hoekstra <hoekstra@few.vu.nl>
Date: Tue, 21 Feb 2012 15:38:16 +0100
To: Sören Auer <auer@informatik.uni-leipzig.de>
Cc: "public-lod@w3.org" <public-lod@w3.org>, "pedantic-web@googlegroups.com" <pedantic-web@googlegroups.com>
Message-ID: <CAGxZet+2p8eie_tezzoGYaxDkTxt1Yc6vonFMS1ZVYnhcROnXg@mail.gmail.com>

Hi Sören, others,

LODStats is certainly great work. Congratulations!

However... is it me, or isn't the 'almost 2B triples' a very
disappointing number? If you go through all datasets advertised on the
Data Hub, the advertised number of triples is over 40B ! This means
that only one out of 20 triples in the linked 'open' data cloud is
publicly accessible.

Another thing... it seems as if LODStats is merely checking whether a
SPARQL endpoint is 'up' and whether the endpoint actually contains the
data that has been advertised on the Data Hub. For instance, my very
own bubble is listed without problems, but I know for a fact that the
triple store no longer contains the data (sorry!). Do you have any
thoughts/ideas on how to detect such problems?

Cheers,
Rinke



On 2 February 2012 13:18, Sören Auer <auer@informatik.uni-leipzig.de> wrote:
> Am 02.02.2012 12:32, schrieb Richard Cyganiak:
>> Congrats, this is awesome.
>
> Thanks Richard, we are happy you like it ;-)
>
>> So you're automatically harvesting 200+ datasets by starting with the LOD Cloud metadata we're collecting on the Data Hub (ex CKAN), leading to a total of almost 2B triples.
>
> Exactly.
>
>> Also fascinating is the list of 250 datasets that couldn't be automatically harvested due to SPARQL errors or errors in the RDF dumps:
>> http://stats.lod2.eu/rdfdoc/?errors=1
>> This is an excellent interoperability testbed and should be closely studied by anyone who's interested in the state of actual interoperability on the web of linked data (hence a CC to the Pedantic Web Group).
>
> Yes, having an interoperability testbed and a timely view on the current
> state was one of the primary reasons for developing LODStats. Some
> problems might, however, also be related to incorrect CKAN metadata or
> some glitches in LODStats itself - we will try to iron them out as much
> as possible in the next weeks.
>
>> One request: on http://stats.lod2.eu/stats it shows top 5 lists of various sorts (top vocabularies, classes, languages etc). Would it be possible to allow drill-down to see longer lists, let's say top 100 or top 1000? These lists are great, but the really interesting stuff often happens in the midfield.
>
> Indeed, thats a great suggestion and will be implemented soon.
>
>> I see VoID summaries for each individual dataset. Are they aggregated somewhere into a single file that I could SPARQL?
>
> Not yet, but that's planned. For now it should be relatively easy to
> crawl and concat the VoID files, but we will make it more convenient ;-)
>
>> Also, how do I cite your work in publications? Is there a paper (or at least tech report) yet?
>
> We submitted a paper, which you can cite:
>
> Jan Demter, Sören Auer, Michael Martin, Jens Lehmann: LODStats – An
> Extensible Framework for High-performance Dataset Analytics, submitted
> to ESWC2012
>
> http://svn.aksw.org/papers/2011/RDFStats/public.pdf
>
> Best,
>
> Sören
>

Received on Tuesday, 21 February 2012 14:38:48 UTC