Re: any standard wys to detect void dataset descriptions? from Wouter Beek on 2018-04-26 (semantic-web@w3.org from April 2018)

From: Wouter Beek <w.g.j.beek@vu.nl>
Date: Thu, 26 Apr 2018 20:50:37 +0200
To: Axel Polleres <axel@polleres.net>
CC: Heiko Paulheim <heiko@informatik.uni-mannheim.de>, "semantic-web@w3.org" <semantic-web@w3.org>
Message-ID: <CAEh2WcMJEEkd-aU463EgwgCp6tUD8HCR31oK3TfeurNoQpTN+Q@mail.gmail.com>

Hi Axel, others,

Three years ago, I did a crawl based on Datahub metadata records and
VoID files from VoID store.  The results were pretty good at the time:
I encountered many errors, but also lots of data, resulting in the LOD
Laundromat dataset of 38B triples (http://lodlaundromat.org).

Unfortunately, when I tried to do the same scrape again one month ago,
I encountered _much_ less data in the LOD Cloud collection.  I was
disappointed, because the LOD Cloud picture has become _bigger_ in the
last two years.  But then again, the LOD Cloud picture is based on
human-entered metadata, the data itself is not always there... (or it
is there, but it cannot be found by automated means).

I now believe that the best way forward is to manually create a list
of URLs from which data can be downloaded.  This may seem extreme, but
it is the last option I see after trying CKAN APIs, VoID, DCAT,
dereferencing IRIs, etc.  E.g., this is how I am able to find the
download locations of the BIO2RDF datasets:
https://github.com/wouterbeek/LOD-Index/blob/master/data/bio2rdf.ttl

Finally, when I tried to represent these download locations in VoID
and DCAT, I noticed that there are very common configurations that
cannot be described by these two vocabularies, e.g., it is not
possible to describe a distribution that consists of multiple files in
DCAT, nor is it possible to describe the RDF serialization format of
individual files in VoID.  These are pretty basic configurations,
e.g., DBpedia has distributions that consists of very many files, some
of which are in different serialization formats.  (To be clear: I
think it is great that people have invested time in creating these
vocabularies, and having them today is better than having nothing at
all, but they need several more iterations/revisions before they can
be used to model real-world data download locations.)

---
Cheers!,
Wouter.

Received on Thursday, 26 April 2018 18:51:58 UTC