Re: Semantic Web Search engines/Billion Triple Challenge or other data-sets? from Wouter Beek on 2016-06-22 (semantic-web@w3.org from June 2016)

From: Wouter Beek <w.g.j.beek@vu.nl>
Date: Wed, 22 Jun 2016 20:43:20 +0200
To: Harry Halpin <hhalpin@ibiblio.org>
CC: Semantic Web <semantic-web@w3.org>
Message-ID: <CAEh2WcPbbF=NHLCB6vBuWi4nkmr1TQ6Q8a1_S4KyY50Efkz+pQ@mail.gmail.com>

Hi Harry,

On Wed, Jun 22, 2016 at 4:55 PM, Harry Halpin <hhalpin@ibiblio.org> wrote:

> Are there any data-sets available that are realistic snapshots of the
> state of linked data in 2016?
>
The tricky bit is in the word "realistic".  We have hundreds of thousands
of data documents in LOD Laundromat and LOTUS.  We are often asked what
would be a representative subset of those documents to, e.g., run
evaluations over.  We have no idea.  We know that some services generate
tens of thousands of data documents all by themselves, but we do not know
whether that is 'fair' or 'unfair'.

When we reran (part of) the experiment of Schmachtenberg et al. ("Adoption
of the linked data best practices in different topical domains", ISWC
2014), we got very different results.  One of the reasons for this is that
the 'original' LOD Cloud contains a lot of vocabularies and human-crafted
knowledge, while LOD Laundromat picks up a lot of statistical and
machine-generated data documents that very much skew the statistics.  An
illustration of this is the list of most occurring namespaces:
http://wouterbeek.github.io/pres/2015-10-15-LOD-Lab.html#/10

A good solution to this problem would IMO require a more stringent
definition of some of the basic concepts that we are currently using only
intuitively.  E.g., what is a dataset?  "A dataset is a set of RDF triples
that are published, maintained or aggregated by a single provider." (VoID
<https://www.w3.org/TR/void/>)  This makes it an inherently social concept,
but we are not using methods from sociology to study this phenomenon ATM.

---
Cheers!,
Wouter.

Received on Wednesday, 22 June 2016 18:44:31 UTC