- From: Denny Vrandecic <denny.vrandecic@wikimedia.de>
- Date: Fri, 22 Jun 2012 11:30:22 +0200
- To: Sören Auer <auer@informatik.uni-leipzig.de>
- Cc: Hugh Glaser <hg@ecs.soton.ac.uk>, Linking Open Data <public-lod@w3.org>, SW-forum <semantic-web@w3.org>
According to your definition, then LODStats is misnamed. It should be LOD Datasets Stats. Or am I misunderstanding something? On 22 Jun 2012, at 01:30, Sören Auer wrote: > Am 21.06.2012 17:08, schrieb Hugh Glaser: >> Hi. >> On 21 Jun 2012, at 11:40, Sören Auer wrote: >> >>> Am 21.06.2012 12:03, schrieb Hugh Glaser: >>>> Interesting question from Denny. >>>> I guess you don't do http://thedatahub.org/dataset/sameas-org >>>> for the same reason. >>>> And >>>> http://thedatahub.org/dataset/dbpedia-lite >>>> (Or at least I couldn't find them.) >>>> >>>> I'm not sure you should claim "all LOD datasets registered on CKAN" >>> >>> Depends on the definition of dataset - for me a dataset is something >>> available in bulk and not a pointer to a large space of URLs containing >>> some data fragments requiring extensive crawling. >> I can't agree with this. >> To rule out Linked Data that only provides Linked Data without SPARQL or dump and say it is not a "LOD Dataset" seems to be terribly restrictive. > > I would distinguish between Linked Data and a LOD dataset: > > For me (and I would assume most people) /dataset/ means a set of data, > i.e. a downloadable dump or bulk data access (e.g. via SPARQL) to a data > repository. > > When the data adheres to the RDF data model and dereferenceable IRIs are > used its a /Linked Data dataset/. > > When licensed under an open license (according to the open definition) > its a /Linked Open Data (LOD) dataset/. > > I agree, that /Linked Data/ also comprises individual data resources > (either independently) or integrated into HTML as RDFa, but I would not > call these dataset then and also not open (if not licensed according to > the open definition). BTW: The open definition also requires bulk data > access! So we have already to reasons, why the concept "LOD dataset" > should imply availability of bulk data. This is also, what we mention > everywhere when describing LODStats. > > When you are interested in statistics about arbitrary Linked Data > Sindice provides probably the better statistics. > >> For example, the eprints (eprints.org) Open Archives have upwards of 100M triples of pretty interesting (to some people) Linked Data. > > Maybe interesting, but if I have to crawl it in order to make use of it > the burden is way too high for most users. > >> It is mostly not in thedatahub, but even if it was you would ignore it. >> In fact, anything that is a wrapper around things like dbpedia, twitter, Facebook, or even Facebook itself is ignored, I am assuming from what you say. > > For DBpedia you don't need a wrapper - the whole dataset is available in > bulk. All others are from my point of view neither datasets nor open. > Maybe you can call them data services, where you can obtain an > individual data item at a time. And why would you want to call a wrapper > dataset. Fundamental requirements for datasets would be from my point of > view that you can apply set operations like merging, joining etc. You > can not do that with wrappers, so why should we call them datasets? > >> To publish statistics that claims to collect "statistics from all LOD datasets" using a method that ignores such resources is to seriously underreport the LOD activity (not a Good Thing), and also is to publish what I can only say is misleading statistical reports about LOD in general. >> I leave aside that you also fail to collect statistics from more than half of the datasets you claim to be collecting. > > I agree, that our figures are quite pessimistic, but in a way, they > reflect, what people really see -- if there is no link to the dump in > thedatahub the dataset is difficult to find obviously, if > confusing/non-standard file extensions or dataset package formats are > used this makes it also very difficult for people to actually use this > data. So I think its better, to be a little more pessimistic in this > case instead of reporting skyrocking numbers all the time. > > Sören >
Received on Friday, 22 June 2012 09:30:57 UTC