- From: Wouter Beek <w.g.j.beek@vu.nl>
- Date: Wed, 22 Jun 2016 20:43:20 +0200
- To: Harry Halpin <hhalpin@ibiblio.org>
- CC: Semantic Web <semantic-web@w3.org>
- Message-ID: <CAEh2WcPbbF=NHLCB6vBuWi4nkmr1TQ6Q8a1_S4KyY50Efkz+pQ@mail.gmail.com>
Hi Harry, On Wed, Jun 22, 2016 at 4:55 PM, Harry Halpin <hhalpin@ibiblio.org> wrote: > Are there any data-sets available that are realistic snapshots of the > state of linked data in 2016? > The tricky bit is in the word "realistic". We have hundreds of thousands of data documents in LOD Laundromat and LOTUS. We are often asked what would be a representative subset of those documents to, e.g., run evaluations over. We have no idea. We know that some services generate tens of thousands of data documents all by themselves, but we do not know whether that is 'fair' or 'unfair'. When we reran (part of) the experiment of Schmachtenberg et al. ("Adoption of the linked data best practices in different topical domains", ISWC 2014), we got very different results. One of the reasons for this is that the 'original' LOD Cloud contains a lot of vocabularies and human-crafted knowledge, while LOD Laundromat picks up a lot of statistical and machine-generated data documents that very much skew the statistics. An illustration of this is the list of most occurring namespaces: http://wouterbeek.github.io/pres/2015-10-15-LOD-Lab.html#/10 A good solution to this problem would IMO require a more stringent definition of some of the basic concepts that we are currently using only intuitively. E.g., what is a dataset? "A dataset is a set of RDF triples that are published, maintained or aggregated by a single provider." (VoID <https://www.w3.org/TR/void/>) This makes it an inherently social concept, but we are not using methods from sociology to study this phenomenon ATM. --- Cheers!, Wouter.
Received on Wednesday, 22 June 2016 18:44:31 UTC