Re: [Ann] LODStats - Real-time Data Web Statistics from Denny Vrandecic on 2012-06-22 (semantic-web@w3.org from June 2012)

From: Denny Vrandecic <denny.vrandecic@wikimedia.de>
Date: Fri, 22 Jun 2012 11:30:22 +0200
To: Sören Auer <auer@informatik.uni-leipzig.de>
Cc: Hugh Glaser <hg@ecs.soton.ac.uk>, Linking Open Data <public-lod@w3.org>, SW-forum <semantic-web@w3.org>
Message-Id: <800F0E85-1CFF-48AD-92A4-5522C9376DBF@wikimedia.de>

According to your definition, then LODStats is misnamed.
It should be LOD Datasets Stats.

Or am I misunderstanding something?


On 22 Jun 2012, at 01:30, Sören Auer wrote:

> Am 21.06.2012 17:08, schrieb Hugh Glaser:
>> Hi.
>> On 21 Jun 2012, at 11:40, Sören Auer wrote:
>> 
>>> Am 21.06.2012 12:03, schrieb Hugh Glaser:
>>>> Interesting question from Denny.
>>>> I guess you don't do http://thedatahub.org/dataset/sameas-org
>>>> for the same reason.
>>>> And
>>>> http://thedatahub.org/dataset/dbpedia-lite
>>>> (Or at least I couldn't find them.)
>>>> 
>>>> I'm not sure you should claim "all LOD datasets registered on CKAN"
>>> 
>>> Depends on the definition of dataset - for me a dataset is something
>>> available in bulk and not a pointer to a large space of URLs containing
>>> some data fragments requiring extensive crawling.
>> I can't agree with this.
>> To rule out Linked Data that only provides Linked Data without SPARQL or dump and say it is not a "LOD Dataset" seems to be terribly restrictive.
> 
> I would distinguish between Linked Data and a LOD dataset:
> 
> For me (and I would assume most people) /dataset/ means a set of data,
> i.e. a downloadable dump or bulk data access (e.g. via SPARQL) to a data
> repository.
> 
> When the data adheres to the RDF data model and dereferenceable IRIs are
> used its a /Linked Data dataset/.
> 
> When licensed under an open license (according to the open definition)
> its a /Linked Open Data (LOD) dataset/.
> 
> I agree, that /Linked Data/ also comprises individual data resources
> (either independently) or integrated into HTML as RDFa, but I would not
> call these dataset then and also not open (if not licensed according to
> the open definition). BTW: The open definition also requires bulk data
> access! So we have already to reasons, why the concept "LOD dataset"
> should imply availability of bulk data. This is also, what we mention
> everywhere when describing LODStats.
> 
> When you are interested in statistics about arbitrary Linked Data
> Sindice provides probably the better statistics.
> 
>> For example, the eprints (eprints.org) Open Archives have upwards of 100M triples of pretty interesting (to some people) Linked Data.
> 
> Maybe interesting, but if I have to crawl it in order to make use of it
> the burden is way too high for most users.
> 
>> It is mostly not in thedatahub, but even if it was you would ignore it.
>> In fact, anything that is a wrapper around things like dbpedia, twitter, Facebook, or even Facebook itself is ignored, I am assuming from what you say.
> 
> For DBpedia you don't need a wrapper - the whole dataset is available in
> bulk. All others are from my point of view neither datasets nor open.
> Maybe you can call them data services, where you can obtain an
> individual data item at a time. And why would you want to call a wrapper
> dataset. Fundamental requirements for datasets would be from my point of
> view that you can apply set operations like merging, joining etc. You
> can not do that with wrappers, so why should we call them datasets?
> 
>> To publish statistics that claims to collect "statistics from all LOD datasets" using a method that ignores such resources is to seriously underreport the LOD activity (not a Good Thing), and also is to publish what I can only say is misleading statistical reports about LOD in general.
>> I leave aside that you also fail to collect statistics from more than half of the datasets you claim to be collecting.
> 
> I agree, that our figures are quite pessimistic, but in a way, they
> reflect, what people really see -- if there is no link to the dump in
> thedatahub the dataset is difficult to find obviously, if
> confusing/non-standard file extensions or dataset package formats are
> used this makes it also very difficult for people to actually use this
> data. So I think its better, to be a little more pessimistic in this
> case instead of reporting skyrocking numbers all the time.
> 
> Sören
>

Received on Friday, 22 June 2012 09:30:57 UTC