Re: # of datasets in LOD cloud diagram

Hi Olaf,

you asked about the number of Linked Datasets on the Web reported in our
paper [1] and for the new LOD cloud diagram [2].
As the numbers might also confuse other people, I did put the LOD mailing
list into the cc.

We seeded our crawl with a large number of URIs from the BTC2012 crawl, the
datahub.io catalog plus some URIs from datasets mentioned on the LOD list.

Our crawler did retrieve RDF data from 1014 data sources [3]. It was blocked
by 77 Linked Data sources via robots.txt. These two numbers together result
in the 1091 Linked Datasets that we report as overall number in our paper.

Unfortunately, only 397 of the crawled datasets were linked to each other
via RDF links (that our crawler discovered) and we thus included only these
datasets [4] into the "Crawlable LOD Cloud 2014" [5].

Please note that this does not mean that there are no other crawlable Linked
Datasets, as we did not do an extensive crawl and our crawler might thus
have missed some datasets. As our crawler only gather a data sample from
each source, it might also have missed some RDF links between datasets. 

We thus asked via the mailing list to point us at additional datasets that
we have missed so far and to enter meta-information about these datasets
into the datahub.io catalog. This call resulted in quite some feedback and
we did draw the LOD cloud 2014 [2] taking this feedback into account. The
570 datasets contained in the new version thus include 

1. datasets that we did crawl
2. datasets that our crawler discovered but did not crawl due to robots.txts
3. additional datasets that resulted from our call for feedback.
4. additional datasets that became linked by adding the datasets from bullet
2. and 3.

As with the previous versions of the cloud, we only included datasets that
are connected to other datasets in the cloud.

After finishing the diagram, we checked for how many of the 570 datasets,
the datahub.io catalog contains meta-information and it turned out that 374
datasets are described in the catalog. 196 datasets were not described in
the catalog yet. For these datasets, we added the meta-information that we
extracted from the crawled data to the catalog using the lodcloud2014
organization [5] in order to keep human- and machine-generated data separate
[5].

Parallel to our  efforts, Tobias Käfer and Andreas Harth from KIT have
conducted a much larger crawl of the Linked Data web and now offer the
resulting dataset for download [6]. They are currently analyzing their data
and it will be interesting to see to which extent their results verify our
findings and how many additional datasets their crawler did discover.

Cheers,

Chris


[1] http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/ISWC-RDB/
[2] http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/
[3]
http://linkeddatacatalog.dws.informatik.uni-mannheim.de/dataset?tags=LinkedD
ataCrawl2014
[3]
http://linkeddatacatalog.dws.informatik.uni-mannheim.de/dataset?tags=crawled
LinkedDataCloud2014
[4]
http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/LODCloudDiagra
m.html
[5] http://datahub.io/organization/lodcloud2014
[6] http://km.aifb.kit.edu/projects/btc-2014/





-----Ursprüngliche Nachricht-----
Von: Olaf Hartig [mailto:ohartig@uwaterloo.ca] 
Gesendet: Samstag, 6. September 2014 15:43
An: max@informatik.uni-mannheim.de; chris@informatik.uni-mannheim.de;
heiko@informatik.uni-mannheim.de
Betreff: # of datasets in LOD cloud diagram

Hi Max, Chris, Heiko,

Laut Eurer Webseite zum aktuellen LOD-Cloud Diagramm
(http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/) zeigt das
Diagramm
570 Datasets. Jedoch, die Vorgänger-Version, welche Ihr im Rahmen Eures
ISWC'14 Papers angefertigt habt, enthält (laut Eures Papers, Table 1) 1091
Datasets. Aber auf der oben genannten Webseite ist plötzlich nur noch die
Rede von 196 Datasets, welche Euer Crawl entdeckt hat. Wie sind diese Zahlen
zu verstehen? Warum sind nicht alle 1091 Datasets im aktuellen Diagramm
enthalten?

Viele Grüße,
Olaf

Received on Monday, 8 September 2014 08:03:46 UTC