- From: Axel Polleres <axel@polleres.net>
- Date: Thu, 26 Apr 2018 12:01:05 -0700
- To: Wouter Beek <w.g.j.beek@vu.nl>
- Cc: Heiko Paulheim <heiko@informatik.uni-mannheim.de>, "semantic-web@w3.org" <semantic-web@w3.org>, Andreas Harth <andreas@harth.org>, Giovanni Tummarello <g.tummarello@gmail.com>, "Dumontier, Michel (IDS)" <michel.dumontier@maastrichtuniversity.nl>
> I now believe that the best way forward is to manually create a list > of URLs from which data can be downloaded. This may seem extreme, but > it is the last option I see after trying CKAN APIs, VoID, DCAT, > dereferencing IRIs, etc. E.g., this is how I am able to find the > download locations of the BIO2RDF datasets: > https://github.com/wouterbeek/LOD-Index/blob/master/data/bio2rdf.ttl Needless to add that the approach of - essentially - seed-based crawling was what was how efforts like SWSE and Sindice (unfortunately both discontinued, cc:ing the leaders of these projects back then) had started with back from kind of pre-dating the LOD-cloud diagram... I agree that we need to get back to that. This is exactly why we need a discussion about shared open common infrastructures, otherwise we will re-develop these things over- and over again with little gain. As for bio2Rdf, FWIW I used a headless browser (selenium) to crawl the datasets [1] recently, as it is jscript based, not ideal, but works. Michel (also cc:ed) will have more to say here, but I guess they are working on relaunching/improving the accessability. cheers, Axel 1. Andreas Harth, Aidan Hogan, Jürgen Umbrich, Sheila Kinsella, Axel Polleres, Stefan Decker: Searching and Browsing Linked Data with SWSE. Semantic Search over the Web 2012: 361-414 2. Eyal Oren, Renaud Delbru, Michele Catasta, Richard Cyganiak, Holger Stenzhorn, Giovanni Tummarello: Sindice.com: a document-oriented lookup index for open linked data. IJMSO 3(1): 37-52 (2008) 3. http://download.bio2rdf.org/#/release/4/ -- Dr. Axel Polleres url: http://www.polleres.net/ twitter: @AxelPolleres > On 26.04.2018, at 11:50, Wouter Beek <w.g.j.beek@vu.nl> wrote: > > Hi Axel, others, > > Three years ago, I did a crawl based on Datahub metadata records and > VoID files from VoID store. The results were pretty good at the time: > I encountered many errors, but also lots of data, resulting in the LOD > Laundromat dataset of 38B triples (http://lodlaundromat.org). > > Unfortunately, when I tried to do the same scrape again one month ago, > I encountered _much_ less data in the LOD Cloud collection. I was > disappointed, because the LOD Cloud picture has become _bigger_ in the > last two years. But then again, the LOD Cloud picture is based on > human-entered metadata, the data itself is not always there... (or it > is there, but it cannot be found by automated means). > > I now believe that the best way forward is to manually create a list > of URLs from which data can be downloaded. This may seem extreme, but > it is the last option I see after trying CKAN APIs, VoID, DCAT, > dereferencing IRIs, etc. E.g., this is how I am able to find the > download locations of the BIO2RDF datasets: > https://github.com/wouterbeek/LOD-Index/blob/master/data/bio2rdf.ttl > > Finally, when I tried to represent these download locations in VoID > and DCAT, I noticed that there are very common configurations that > cannot be described by these two vocabularies, e.g., it is not > possible to describe a distribution that consists of multiple files in > DCAT, nor is it possible to describe the RDF serialization format of > individual files in VoID. These are pretty basic configurations, > e.g., DBpedia has distributions that consists of very many files, some > of which are in different serialization formats. (To be clear: I > think it is great that people have invested time in creating these > vocabularies, and having them today is better than having nothing at > all, but they need several more iterations/revisions before they can > be used to model real-world data download locations.) > > --- > Cheers!, > Wouter.
Received on Thursday, 26 April 2018 19:01:36 UTC