Re: any standard wys to detect void dataset descriptions? from Axel Polleres on 2018-04-26 (semantic-web@w3.org from April 2018)

From: Axel Polleres <axel@polleres.net>
Date: Thu, 26 Apr 2018 12:01:05 -0700
To: Wouter Beek <w.g.j.beek@vu.nl>
Cc: Heiko Paulheim <heiko@informatik.uni-mannheim.de>, "semantic-web@w3.org" <semantic-web@w3.org>, Andreas Harth <andreas@harth.org>, Giovanni Tummarello <g.tummarello@gmail.com>, "Dumontier, Michel (IDS)" <michel.dumontier@maastrichtuniversity.nl>
Message-Id: <DF09EB4B-7C41-4012-80ED-C263DE08463C@polleres.net>

> I now believe that the best way forward is to manually create a list
> of URLs from which data can be downloaded.  This may seem extreme, but
> it is the last option I see after trying CKAN APIs, VoID, DCAT,
> dereferencing IRIs, etc.  E.g., this is how I am able to find the
> download locations of the BIO2RDF datasets:
> https://github.com/wouterbeek/LOD-Index/blob/master/data/bio2rdf.ttl

Needless to add that the approach of - essentially - seed-based crawling was what was how efforts like SWSE and Sindice (unfortunately both discontinued, cc:ing the leaders of these projects back then) 
had started with back from kind of pre-dating the LOD-cloud diagram... I agree that we need to get back to that.
This is exactly why we need a discussion about shared open common infrastructures, otherwise we will re-develop these things over- and over again with little gain. 

As for bio2Rdf, FWIW I used a headless browser (selenium) to crawl the datasets [1] recently, as it is jscript based, not ideal, but works.
Michel (also cc:ed) will have more to say here, but I guess they are working on relaunching/improving the accessability.

cheers,
Axel

1. Andreas Harth, Aidan Hogan, Jürgen Umbrich, Sheila Kinsella, Axel Polleres, Stefan Decker:
Searching and Browsing Linked Data with SWSE. Semantic Search over the Web 2012: 361-414

2.  Eyal Oren, Renaud Delbru, Michele Catasta, Richard Cyganiak, Holger Stenzhorn, Giovanni Tummarello:
Sindice.com: a document-oriented lookup index for open linked data. IJMSO 3(1): 37-52 (2008)

3. http://download.bio2rdf.org/#/release/4/ 
--
Dr. Axel Polleres 
url: http://www.polleres.net/  twitter: @AxelPolleres

> On 26.04.2018, at 11:50, Wouter Beek <w.g.j.beek@vu.nl> wrote:
> 
> Hi Axel, others,
> 
> Three years ago, I did a crawl based on Datahub metadata records and
> VoID files from VoID store.  The results were pretty good at the time:
> I encountered many errors, but also lots of data, resulting in the LOD
> Laundromat dataset of 38B triples (http://lodlaundromat.org).
> 
> Unfortunately, when I tried to do the same scrape again one month ago,
> I encountered _much_ less data in the LOD Cloud collection.  I was
> disappointed, because the LOD Cloud picture has become _bigger_ in the
> last two years.  But then again, the LOD Cloud picture is based on
> human-entered metadata, the data itself is not always there... (or it
> is there, but it cannot be found by automated means).
> 
> I now believe that the best way forward is to manually create a list
> of URLs from which data can be downloaded.  This may seem extreme, but
> it is the last option I see after trying CKAN APIs, VoID, DCAT,
> dereferencing IRIs, etc.  E.g., this is how I am able to find the
> download locations of the BIO2RDF datasets:
> https://github.com/wouterbeek/LOD-Index/blob/master/data/bio2rdf.ttl
> 
> Finally, when I tried to represent these download locations in VoID
> and DCAT, I noticed that there are very common configurations that
> cannot be described by these two vocabularies, e.g., it is not
> possible to describe a distribution that consists of multiple files in
> DCAT, nor is it possible to describe the RDF serialization format of
> individual files in VoID.  These are pretty basic configurations,
> e.g., DBpedia has distributions that consists of very many files, some
> of which are in different serialization formats.  (To be clear: I
> think it is great that people have invested time in creating these
> vocabularies, and having them today is better than having nothing at
> all, but they need several more iterations/revisions before they can
> be used to model real-world data download locations.)
> 
> ---
> Cheers!,
> Wouter.

Received on Thursday, 26 April 2018 19:01:36 UTC