- From: Sören Auer <auer@informatik.uni-leipzig.de>
- Date: Fri, 22 Jun 2012 01:30:28 +0200
- To: Hugh Glaser <hg@ecs.soton.ac.uk>
- CC: Linking Open Data <public-lod@w3.org>, SW-forum <semantic-web@w3.org>
Am 21.06.2012 17:08, schrieb Hugh Glaser: > Hi. > On 21 Jun 2012, at 11:40, Sören Auer wrote: > >> Am 21.06.2012 12:03, schrieb Hugh Glaser: >>> Interesting question from Denny. >>> I guess you don't do http://thedatahub.org/dataset/sameas-org >>> for the same reason. >>> And >>> http://thedatahub.org/dataset/dbpedia-lite >>> (Or at least I couldn't find them.) >>> >>> I'm not sure you should claim "all LOD datasets registered on CKAN" >> >> Depends on the definition of dataset - for me a dataset is something >> available in bulk and not a pointer to a large space of URLs containing >> some data fragments requiring extensive crawling. > I can't agree with this. > To rule out Linked Data that only provides Linked Data without SPARQL or dump and say it is not a "LOD Dataset" seems to be terribly restrictive. I would distinguish between Linked Data and a LOD dataset: For me (and I would assume most people) /dataset/ means a set of data, i.e. a downloadable dump or bulk data access (e.g. via SPARQL) to a data repository. When the data adheres to the RDF data model and dereferenceable IRIs are used its a /Linked Data dataset/. When licensed under an open license (according to the open definition) its a /Linked Open Data (LOD) dataset/. I agree, that /Linked Data/ also comprises individual data resources (either independently) or integrated into HTML as RDFa, but I would not call these dataset then and also not open (if not licensed according to the open definition). BTW: The open definition also requires bulk data access! So we have already to reasons, why the concept "LOD dataset" should imply availability of bulk data. This is also, what we mention everywhere when describing LODStats. When you are interested in statistics about arbitrary Linked Data Sindice provides probably the better statistics. > For example, the eprints (eprints.org) Open Archives have upwards of 100M triples of pretty interesting (to some people) Linked Data. Maybe interesting, but if I have to crawl it in order to make use of it the burden is way too high for most users. > It is mostly not in thedatahub, but even if it was you would ignore it. > In fact, anything that is a wrapper around things like dbpedia, twitter, Facebook, or even Facebook itself is ignored, I am assuming from what you say. For DBpedia you don't need a wrapper - the whole dataset is available in bulk. All others are from my point of view neither datasets nor open. Maybe you can call them data services, where you can obtain an individual data item at a time. And why would you want to call a wrapper dataset. Fundamental requirements for datasets would be from my point of view that you can apply set operations like merging, joining etc. You can not do that with wrappers, so why should we call them datasets? > To publish statistics that claims to collect "statistics from all LOD datasets" using a method that ignores such resources is to seriously underreport the LOD activity (not a Good Thing), and also is to publish what I can only say is misleading statistical reports about LOD in general. > I leave aside that you also fail to collect statistics from more than half of the datasets you claim to be collecting. I agree, that our figures are quite pessimistic, but in a way, they reflect, what people really see -- if there is no link to the dump in thedatahub the dataset is difficult to find obviously, if confusing/non-standard file extensions or dataset package formats are used this makes it also very difficult for people to actually use this data. So I think its better, to be a little more pessimistic in this case instead of reporting skyrocking numbers all the time. Sören
Received on Thursday, 21 June 2012 23:30:54 UTC