Re: Updated LOD Cloud Diagram - Missed data sources. from ahogan@dcc.uchile.cl on 2014-07-25 (public-lod@w3.org from July 2014)

From: <ahogan@dcc.uchile.cl>
Date: Fri, 25 Jul 2014 14:14:57 -0400
To: public-lod@w3.org
Message-ID: <7dceb8d0a74dee064ec109ac3af37cd8.squirrel@webmail.dcc.uchile.cl>
On 25/07/2014 06:04, Christian Bizer wrote:
> These problems are also the reason why we ask people on the list to
point us at additional data sources, so that we upcoming cloud diagram
can be as comprehensive as possible and it would be great if you could
also point us at your sites.

Just to mention that we did a similar comparison between what was captured
by the BTC 2011 crawl and the 2011 LOD cloud (stats sourced from CKAN, as
it was at the time). The comparison is discussed in detail in [1,Section
4].

To summarise the results, the BTC-11 dataset (a "real-world crawl") was
not what would be expected based on the statistics claimed by publishers
in CKAN/LOD cloud. Granted the BTC-11 dataset can only sample and was only
accessing RDF/XML at the time, but based on the public access logs, we
found that the crawl encountered many problems accessing the various
datasets in the catalogue: robots.txt, 401s, 502s, bad conneg, 404/dead,
etc.

The result was that of the 25 largest Linked Datasets claimed in CKAN
(from 9 billion down to 93 million triples), BTC-11 could access a sample
of more than 1 million triples from only 9 of these. 11/25 yielded zero
data, 13/25 returned fewer than a thousand triples.

It seems similar problems are now being encountered. And seeing them being
acknowledged is great!


I don't wish to diminish the great work put into piecing together the
previous versions of the LOD Cloud -- which has been the friendly face for
the ongoing work in the Linked Data community for quite some number of
years -- but in my experience, there has always been quite a very large
gap between what is/was promised by the LOD Cloud we've all being using in
our talks and paper introductions and what was available in reality (as an
analysis of the publicly accessible crawl logs of the BTC datasets down
through the years will attest to).

As such, I very much welcome this empirically validated version of the
cloud and congratulate those involved!

Likewise I believe that the current version uses a definition of a dataset
as referring to a given PLD. I think this is great since it gives a much
better sense of the diversity of Linked Data publishing, rather than
having 20 mini-bubbles referring to one site. Also it provides a clear
technical definition of a dataset (as opposed to something that was added
to a catalogue). However, I agree with Sarven that this may require some
thinking on how best to represent the LOD cloud now. For example, the
representation of statusnet-platform datasets may need some consideration
in the current draft, maybe in the form of a sub-cloud or cluster.



Finally I wanted to raise one other troubling observation re: the LOD
Cloud, which was that *only one new dataset was added to the LOD Cloud
group in datahub over a period of twelve months* [2,3]. Jerven just added
the first dataset in 8 months, presumably due to this ongoing discussion.
One can scroll back in time arbitrarily far in the activity log of the
group to see precisely how much activity there has (not) been [3] (e.g.,
Ctrl+F for "created"). It's not comfortable reading but I think that we,
as a community, should seriously ask ourselves: why there has been so
little new activity?

Best,
Aidan



[1] Tobias Käfer, Jürgen Umbrich, Aidan Hogan, Axel Polleres. "Towards a
Dynamic Linked Data Observatory". LDOW 2012.
 - http://aidanhogan.com/docs/dyldo_ldow12.pdf

[2] Aidan Hogan, Claudio Gutierrez "Paths towards the Sustainable
Consumption of Semantic Data on the Web". In the Proceedings of the
Alberto Mendelzon Workshop (AMW), Cartagena, Columbia, 4–6 June, 2014.
 - http://aidanhogan.com/docs/amw_2014.pdf

[3] http://datahub.io/group/activity/lodcloud/0
Received on Friday, 25 July 2014 18:15:23 UTC