- From: Wouter Beek <w.g.j.beek@vu.nl>
- Date: Thu, 26 Apr 2018 20:50:37 +0200
- To: Axel Polleres <axel@polleres.net>
- CC: Heiko Paulheim <heiko@informatik.uni-mannheim.de>, "semantic-web@w3.org" <semantic-web@w3.org>
Hi Axel, others, Three years ago, I did a crawl based on Datahub metadata records and VoID files from VoID store. The results were pretty good at the time: I encountered many errors, but also lots of data, resulting in the LOD Laundromat dataset of 38B triples (http://lodlaundromat.org). Unfortunately, when I tried to do the same scrape again one month ago, I encountered _much_ less data in the LOD Cloud collection. I was disappointed, because the LOD Cloud picture has become _bigger_ in the last two years. But then again, the LOD Cloud picture is based on human-entered metadata, the data itself is not always there... (or it is there, but it cannot be found by automated means). I now believe that the best way forward is to manually create a list of URLs from which data can be downloaded. This may seem extreme, but it is the last option I see after trying CKAN APIs, VoID, DCAT, dereferencing IRIs, etc. E.g., this is how I am able to find the download locations of the BIO2RDF datasets: https://github.com/wouterbeek/LOD-Index/blob/master/data/bio2rdf.ttl Finally, when I tried to represent these download locations in VoID and DCAT, I noticed that there are very common configurations that cannot be described by these two vocabularies, e.g., it is not possible to describe a distribution that consists of multiple files in DCAT, nor is it possible to describe the RDF serialization format of individual files in VoID. These are pretty basic configurations, e.g., DBpedia has distributions that consists of very many files, some of which are in different serialization formats. (To be clear: I think it is great that people have invested time in creating these vocabularies, and having them today is better than having nothing at all, but they need several more iterations/revisions before they can be used to model real-world data download locations.) --- Cheers!, Wouter.
Received on Thursday, 26 April 2018 18:51:58 UTC