Re: [Ann] LODStats - Real-time Data Web Statistics from Sören Auer on 2012-06-21 (public-lod@w3.org from June 2012)

From: Sören Auer <auer@informatik.uni-leipzig.de>
Date: Fri, 22 Jun 2012 01:30:28 +0200
To: Hugh Glaser <hg@ecs.soton.ac.uk>
CC: Linking Open Data <public-lod@w3.org>, SW-forum <semantic-web@w3.org>
Message-ID: <4FE3AE94.4040504@informatik.uni-leipzig.de>

Am 21.06.2012 17:08, schrieb Hugh Glaser:
> Hi.
> On 21 Jun 2012, at 11:40, Sören Auer wrote:
> 
>> Am 21.06.2012 12:03, schrieb Hugh Glaser:
>>> Interesting question from Denny.
>>> I guess you don't do http://thedatahub.org/dataset/sameas-org
>>> for the same reason.
>>> And
>>> http://thedatahub.org/dataset/dbpedia-lite
>>> (Or at least I couldn't find them.)
>>>
>>> I'm not sure you should claim "all LOD datasets registered on CKAN"
>>
>> Depends on the definition of dataset - for me a dataset is something
>> available in bulk and not a pointer to a large space of URLs containing
>> some data fragments requiring extensive crawling.
> I can't agree with this.
> To rule out Linked Data that only provides Linked Data without SPARQL or dump and say it is not a "LOD Dataset" seems to be terribly restrictive.

I would distinguish between Linked Data and a LOD dataset:

For me (and I would assume most people) /dataset/ means a set of data,
i.e. a downloadable dump or bulk data access (e.g. via SPARQL) to a data
repository.

When the data adheres to the RDF data model and dereferenceable IRIs are
used its a /Linked Data dataset/.

When licensed under an open license (according to the open definition)
its a /Linked Open Data (LOD) dataset/.

I agree, that /Linked Data/ also comprises individual data resources
(either independently) or integrated into HTML as RDFa, but I would not
call these dataset then and also not open (if not licensed according to
the open definition). BTW: The open definition also requires bulk data
access! So we have already to reasons, why the concept "LOD dataset"
should imply availability of bulk data. This is also, what we mention
everywhere when describing LODStats.

When you are interested in statistics about arbitrary Linked Data
Sindice provides probably the better statistics.

> For example, the eprints (eprints.org) Open Archives have upwards of 100M triples of pretty interesting (to some people) Linked Data.

Maybe interesting, but if I have to crawl it in order to make use of it
the burden is way too high for most users.

> It is mostly not in thedatahub, but even if it was you would ignore it.
> In fact, anything that is a wrapper around things like dbpedia, twitter, Facebook, or even Facebook itself is ignored, I am assuming from what you say.

For DBpedia you don't need a wrapper - the whole dataset is available in
bulk. All others are from my point of view neither datasets nor open.
Maybe you can call them data services, where you can obtain an
individual data item at a time. And why would you want to call a wrapper
dataset. Fundamental requirements for datasets would be from my point of
view that you can apply set operations like merging, joining etc. You
can not do that with wrappers, so why should we call them datasets?

> To publish statistics that claims to collect "statistics from all LOD datasets" using a method that ignores such resources is to seriously underreport the LOD activity (not a Good Thing), and also is to publish what I can only say is misleading statistical reports about LOD in general.
> I leave aside that you also fail to collect statistics from more than half of the datasets you claim to be collecting.

I agree, that our figures are quite pessimistic, but in a way, they
reflect, what people really see -- if there is no link to the dump in
thedatahub the dataset is difficult to find obviously, if
confusing/non-standard file extensions or dataset package formats are
used this makes it also very difficult for people to actually use this
data. So I think its better, to be a little more pessimistic in this
case instead of reporting skyrocking numbers all the time.

Sören

Received on Thursday, 21 June 2012 23:30:54 UTC