- From: Christian Bizer <chris@bizer.de>
- Date: Mon, 8 Sep 2014 10:03:20 +0200
- To: "'Olaf Hartig'" <ohartig@uwaterloo.ca>
- Cc: <max@informatik.uni-mannheim.de>, <chris@informatik.uni-mannheim.de>, <heiko@informatik.uni-mannheim.de>, <public-lod@w3.org>
Hi Olaf, you asked about the number of Linked Datasets on the Web reported in our paper [1] and for the new LOD cloud diagram [2]. As the numbers might also confuse other people, I did put the LOD mailing list into the cc. We seeded our crawl with a large number of URIs from the BTC2012 crawl, the datahub.io catalog plus some URIs from datasets mentioned on the LOD list. Our crawler did retrieve RDF data from 1014 data sources [3]. It was blocked by 77 Linked Data sources via robots.txt. These two numbers together result in the 1091 Linked Datasets that we report as overall number in our paper. Unfortunately, only 397 of the crawled datasets were linked to each other via RDF links (that our crawler discovered) and we thus included only these datasets [4] into the "Crawlable LOD Cloud 2014" [5]. Please note that this does not mean that there are no other crawlable Linked Datasets, as we did not do an extensive crawl and our crawler might thus have missed some datasets. As our crawler only gather a data sample from each source, it might also have missed some RDF links between datasets. We thus asked via the mailing list to point us at additional datasets that we have missed so far and to enter meta-information about these datasets into the datahub.io catalog. This call resulted in quite some feedback and we did draw the LOD cloud 2014 [2] taking this feedback into account. The 570 datasets contained in the new version thus include 1. datasets that we did crawl 2. datasets that our crawler discovered but did not crawl due to robots.txts 3. additional datasets that resulted from our call for feedback. 4. additional datasets that became linked by adding the datasets from bullet 2. and 3. As with the previous versions of the cloud, we only included datasets that are connected to other datasets in the cloud. After finishing the diagram, we checked for how many of the 570 datasets, the datahub.io catalog contains meta-information and it turned out that 374 datasets are described in the catalog. 196 datasets were not described in the catalog yet. For these datasets, we added the meta-information that we extracted from the crawled data to the catalog using the lodcloud2014 organization [5] in order to keep human- and machine-generated data separate [5]. Parallel to our efforts, Tobias Käfer and Andreas Harth from KIT have conducted a much larger crawl of the Linked Data web and now offer the resulting dataset for download [6]. They are currently analyzing their data and it will be interesting to see to which extent their results verify our findings and how many additional datasets their crawler did discover. Cheers, Chris [1] http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/ISWC-RDB/ [2] http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/ [3] http://linkeddatacatalog.dws.informatik.uni-mannheim.de/dataset?tags=LinkedD ataCrawl2014 [3] http://linkeddatacatalog.dws.informatik.uni-mannheim.de/dataset?tags=crawled LinkedDataCloud2014 [4] http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/LODCloudDiagra m.html [5] http://datahub.io/organization/lodcloud2014 [6] http://km.aifb.kit.edu/projects/btc-2014/ -----Ursprüngliche Nachricht----- Von: Olaf Hartig [mailto:ohartig@uwaterloo.ca] Gesendet: Samstag, 6. September 2014 15:43 An: max@informatik.uni-mannheim.de; chris@informatik.uni-mannheim.de; heiko@informatik.uni-mannheim.de Betreff: # of datasets in LOD cloud diagram Hi Max, Chris, Heiko, Laut Eurer Webseite zum aktuellen LOD-Cloud Diagramm (http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/) zeigt das Diagramm 570 Datasets. Jedoch, die Vorgänger-Version, welche Ihr im Rahmen Eures ISWC'14 Papers angefertigt habt, enthält (laut Eures Papers, Table 1) 1091 Datasets. Aber auf der oben genannten Webseite ist plötzlich nur noch die Rede von 196 Datasets, welche Euer Crawl entdeckt hat. Wie sind diese Zahlen zu verstehen? Warum sind nicht alle 1091 Datasets im aktuellen Diagramm enthalten? Viele Grüße, Olaf
Received on Monday, 8 September 2014 08:03:46 UTC