- From: <ahogan@dcc.uchile.cl>
- Date: Fri, 25 Jul 2014 14:14:57 -0400
- To: public-lod@w3.org
On 25/07/2014 06:04, Christian Bizer wrote: > These problems are also the reason why we ask people on the list to point us at additional data sources, so that we upcoming cloud diagram can be as comprehensive as possible and it would be great if you could also point us at your sites. Just to mention that we did a similar comparison between what was captured by the BTC 2011 crawl and the 2011 LOD cloud (stats sourced from CKAN, as it was at the time). The comparison is discussed in detail in [1,Section 4]. To summarise the results, the BTC-11 dataset (a "real-world crawl") was not what would be expected based on the statistics claimed by publishers in CKAN/LOD cloud. Granted the BTC-11 dataset can only sample and was only accessing RDF/XML at the time, but based on the public access logs, we found that the crawl encountered many problems accessing the various datasets in the catalogue: robots.txt, 401s, 502s, bad conneg, 404/dead, etc. The result was that of the 25 largest Linked Datasets claimed in CKAN (from 9 billion down to 93 million triples), BTC-11 could access a sample of more than 1 million triples from only 9 of these. 11/25 yielded zero data, 13/25 returned fewer than a thousand triples. It seems similar problems are now being encountered. And seeing them being acknowledged is great! I don't wish to diminish the great work put into piecing together the previous versions of the LOD Cloud -- which has been the friendly face for the ongoing work in the Linked Data community for quite some number of years -- but in my experience, there has always been quite a very large gap between what is/was promised by the LOD Cloud we've all being using in our talks and paper introductions and what was available in reality (as an analysis of the publicly accessible crawl logs of the BTC datasets down through the years will attest to). As such, I very much welcome this empirically validated version of the cloud and congratulate those involved! Likewise I believe that the current version uses a definition of a dataset as referring to a given PLD. I think this is great since it gives a much better sense of the diversity of Linked Data publishing, rather than having 20 mini-bubbles referring to one site. Also it provides a clear technical definition of a dataset (as opposed to something that was added to a catalogue). However, I agree with Sarven that this may require some thinking on how best to represent the LOD cloud now. For example, the representation of statusnet-platform datasets may need some consideration in the current draft, maybe in the form of a sub-cloud or cluster. Finally I wanted to raise one other troubling observation re: the LOD Cloud, which was that *only one new dataset was added to the LOD Cloud group in datahub over a period of twelve months* [2,3]. Jerven just added the first dataset in 8 months, presumably due to this ongoing discussion. One can scroll back in time arbitrarily far in the activity log of the group to see precisely how much activity there has (not) been [3] (e.g., Ctrl+F for "created"). It's not comfortable reading but I think that we, as a community, should seriously ask ourselves: why there has been so little new activity? Best, Aidan [1] Tobias Käfer, Jürgen Umbrich, Aidan Hogan, Axel Polleres. "Towards a Dynamic Linked Data Observatory". LDOW 2012. - http://aidanhogan.com/docs/dyldo_ldow12.pdf [2] Aidan Hogan, Claudio Gutierrez "Paths towards the Sustainable Consumption of Semantic Data on the Web". In the Proceedings of the Alberto Mendelzon Workshop (AMW), Cartagena, Columbia, 4–6 June, 2014. - http://aidanhogan.com/docs/amw_2014.pdf [3] http://datahub.io/group/activity/lodcloud/0
Received on Friday, 25 July 2014 18:15:23 UTC