Re: [Ann] LODStats - Real-time Data Web Statistics from Sören Auer on 2012-06-22 (semantic-web@w3.org from June 2012)

From: Sören Auer <auer@informatik.uni-leipzig.de>
Date: Fri, 22 Jun 2012 20:31:26 +0200
To: Denny Vrandecic <denny.vrandecic@wikimedia.de>
CC: Hugh Glaser <hg@ecs.soton.ac.uk>, Linking Open Data <public-lod@w3.org>, SW-forum <semantic-web@w3.org>
Message-ID: <4FE4B9FE.1080201@informatik.uni-leipzig.de>
Am 22.06.2012 11:30, schrieb Denny Vrandecic:
> According to your definition, then LODStats is misnamed.
> It should be LOD Datasets Stats.
> 
> Or am I misunderstanding something?

Maybe you are right Denny, but there is never a perfect name.
Actually LODStats is both, a tool and a service. The open-source tool
(https://github.com/AKSW/LODStats) can be used for analysing anything.
If you are not happy with our selection criteria in the service, you can
run your own LODStats installation, put a crawler in front and analyse
all the datasets you want. Just our service at stats.lod2.eu is a little
selective ;-)

Best,

Sören

> On 22 Jun 2012, at 01:30, Sören Auer wrote:
> 
>> Am 21.06.2012 17:08, schrieb Hugh Glaser:
>>> Hi.
>>> On 21 Jun 2012, at 11:40, Sören Auer wrote:
>>>
>>>> Am 21.06.2012 12:03, schrieb Hugh Glaser:
>>>>> Interesting question from Denny.
>>>>> I guess you don't do http://thedatahub.org/dataset/sameas-org
>>>>> for the same reason.
>>>>> And
>>>>> http://thedatahub.org/dataset/dbpedia-lite
>>>>> (Or at least I couldn't find them.)
>>>>>
>>>>> I'm not sure you should claim "all LOD datasets registered on CKAN"
>>>>
>>>> Depends on the definition of dataset - for me a dataset is something
>>>> available in bulk and not a pointer to a large space of URLs containing
>>>> some data fragments requiring extensive crawling.
>>> I can't agree with this.
>>> To rule out Linked Data that only provides Linked Data without SPARQL or dump and say it is not a "LOD Dataset" seems to be terribly restrictive.
>>
>> I would distinguish between Linked Data and a LOD dataset:
>>
>> For me (and I would assume most people) /dataset/ means a set of data,
>> i.e. a downloadable dump or bulk data access (e.g. via SPARQL) to a data
>> repository.
>>
>> When the data adheres to the RDF data model and dereferenceable IRIs are
>> used its a /Linked Data dataset/.
>>
>> When licensed under an open license (according to the open definition)
>> its a /Linked Open Data (LOD) dataset/.
>>
>> I agree, that /Linked Data/ also comprises individual data resources
>> (either independently) or integrated into HTML as RDFa, but I would not
>> call these dataset then and also not open (if not licensed according to
>> the open definition). BTW: The open definition also requires bulk data
>> access! So we have already to reasons, why the concept "LOD dataset"
>> should imply availability of bulk data. This is also, what we mention
>> everywhere when describing LODStats.
>>
>> When you are interested in statistics about arbitrary Linked Data
>> Sindice provides probably the better statistics.
>>
>>> For example, the eprints (eprints.org) Open Archives have upwards of 100M triples of pretty interesting (to some people) Linked Data.
>>
>> Maybe interesting, but if I have to crawl it in order to make use of it
>> the burden is way too high for most users.
>>
>>> It is mostly not in thedatahub, but even if it was you would ignore it.
>>> In fact, anything that is a wrapper around things like dbpedia, twitter, Facebook, or even Facebook itself is ignored, I am assuming from what you say.
>>
>> For DBpedia you don't need a wrapper - the whole dataset is available in
>> bulk. All others are from my point of view neither datasets nor open.
>> Maybe you can call them data services, where you can obtain an
>> individual data item at a time. And why would you want to call a wrapper
>> dataset. Fundamental requirements for datasets would be from my point of
>> view that you can apply set operations like merging, joining etc. You
>> can not do that with wrappers, so why should we call them datasets?
>>
>>> To publish statistics that claims to collect "statistics from all LOD datasets" using a method that ignores such resources is to seriously underreport the LOD activity (not a Good Thing), and also is to publish what I can only say is misleading statistical reports about LOD in general.
>>> I leave aside that you also fail to collect statistics from more than half of the datasets you claim to be collecting.
>>
>> I agree, that our figures are quite pessimistic, but in a way, they
>> reflect, what people really see -- if there is no link to the dump in
>> thedatahub the dataset is difficult to find obviously, if
>> confusing/non-standard file extensions or dataset package formats are
>> used this makes it also very difficult for people to actually use this
>> data. So I think its better, to be a little more pessimistic in this
>> case instead of reporting skyrocking numbers all the time.
>>
>> Sören
>>
> 
>
Received on Friday, 22 June 2012 18:31:46 UTC