W3C home > Mailing lists > Public > public-lod@w3.org > September 2014

Re: # of datasets in LOD cloud diagram

From: Olaf Hartig <ohartig@uwaterloo.ca>
Date: Mon, 8 Sep 2014 09:13:18 -0400
To: <public-lod@w3.org>
CC: Christian Bizer <chris@bizer.de>
Message-ID: <1502546.WMIbcxSXXi@porty2>
Chris, thanks for the explanation!
Olaf


On Monday 08 September 2014 10:03:20 Christian Bizer wrote:
> Hi Olaf,
> 
> you asked about the number of Linked Datasets on the Web reported in our
> paper [1] and for the new LOD cloud diagram [2].
> As the numbers might also confuse other people, I did put the LOD mailing
> list into the cc.
> 
> We seeded our crawl with a large number of URIs from the BTC2012 crawl, the
> datahub.io catalog plus some URIs from datasets mentioned on the LOD list.
> 
> Our crawler did retrieve RDF data from 1014 data sources [3]. It was blocked
> by 77 Linked Data sources via robots.txt. These two numbers together result
> in the 1091 Linked Datasets that we report as overall number in our paper.
> 
> Unfortunately, only 397 of the crawled datasets were linked to each other
> via RDF links (that our crawler discovered) and we thus included only these
> datasets [4] into the "Crawlable LOD Cloud 2014" [5].
> 
> Please note that this does not mean that there are no other crawlable Linked
> Datasets, as we did not do an extensive crawl and our crawler might thus
> have missed some datasets. As our crawler only gather a data sample from
> each source, it might also have missed some RDF links between datasets.
> 
> We thus asked via the mailing list to point us at additional datasets that
> we have missed so far and to enter meta-information about these datasets
> into the datahub.io catalog. This call resulted in quite some feedback and
> we did draw the LOD cloud 2014 [2] taking this feedback into account. The
> 570 datasets contained in the new version thus include
> 
> 1. datasets that we did crawl
> 2. datasets that our crawler discovered but did not crawl due to robots.txts
> 3. additional datasets that resulted from our call for feedback.
> 4. additional datasets that became linked by adding the datasets from bullet
> 2. and 3.
> 
> As with the previous versions of the cloud, we only included datasets that
> are connected to other datasets in the cloud.
> 
> After finishing the diagram, we checked for how many of the 570 datasets,
> the datahub.io catalog contains meta-information and it turned out that 374
> datasets are described in the catalog. 196 datasets were not described in
> the catalog yet. For these datasets, we added the meta-information that we
> extracted from the crawled data to the catalog using the lodcloud2014
> organization [5] in order to keep human- and machine-generated data separate
> [5].
> 
> Parallel to our  efforts, Tobias Käfer and Andreas Harth from KIT have
> conducted a much larger crawl of the Linked Data web and now offer the
> resulting dataset for download [6]. They are currently analyzing their data
> and it will be interesting to see to which extent their results verify our
> findings and how many additional datasets their crawler did discover.
> 
> Cheers,
> 
> Chris
> 
> 
> [1] http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/ISWC-RDB/
> [2] http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/
> [3]
> http://linkeddatacatalog.dws.informatik.uni-mannheim.de/dataset?tags=LinkedD
> ataCrawl2014
> [3]
> http://linkeddatacatalog.dws.informatik.uni-mannheim.de/dataset?tags=crawled
> LinkedDataCloud2014
> [4]
> http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/LODCloudDiagra
> m.html
> [5] http://datahub.io/organization/lodcloud2014
> [6] http://km.aifb.kit.edu/projects/btc-2014/
> 
> 
> 
> 
> 
> -----Ursprüngliche Nachricht-----
> Von: Olaf Hartig [mailto:ohartig@uwaterloo.ca]
> Gesendet: Samstag, 6. September 2014 15:43
> An: max@informatik.uni-mannheim.de; chris@informatik.uni-mannheim.de;
> heiko@informatik.uni-mannheim.de
> Betreff: # of datasets in LOD cloud diagram
> 
> Hi Max, Chris, Heiko,
> 
> Laut Eurer Webseite zum aktuellen LOD-Cloud Diagramm
> (http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/) zeigt das
> Diagramm
> 570 Datasets. Jedoch, die Vorgänger-Version, welche Ihr im Rahmen Eures
> ISWC'14 Papers angefertigt habt, enthält (laut Eures Papers, Table 1) 1091
> Datasets. Aber auf der oben genannten Webseite ist plötzlich nur noch die
> Rede von 196 Datasets, welche Euer Crawl entdeckt hat. Wie sind diese Zahlen
> zu verstehen? Warum sind nicht alle 1091 Datasets im aktuellen Diagramm
> enthalten?
> 
> Viele Grüße,
> Olaf
Received on Monday, 8 September 2014 13:13:53 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:16:50 UTC