Re: Updated LOD Cloud Diagram - what is the message? from Christian Bizer on 2014-08-18 (public-lod@w3.org from August 2014)

From: Christian Bizer <chris@bizer.de>
Date: Mon, 18 Aug 2014 11:06:50 +0200
To: "'Giovanni Tummarello'" <g.tummarello@gmail.com>
Cc: "'Linking Open Data'" <public-lod@w3.org>
Message-ID: <15a901cfbac3$bef8a8f0$3ce9fad0$@bizer.de>
Hi Giovanni and all,

 

our goals in creating the new diagram were mostly empirical as we wanted to know ourselves and report how the Linked Open Data cloud has evolved since 2011.

 

So we don’t plan to push a specific message with the diagram, but I agree with you that the release of the diagram could be a good occasion for the community to discuss the possible messages/conclusions that one could draw from it and I would be happy if more people would comment on this.

 

My two cents to this discussion:

 

I think that it is hard to draw a single conclusion from the diagram, but that the conclusion depends on the specific requirements of each data consumer. 

 

If you build an application that requires DBpedia/YAGO/Freebase/UMBEL/Cyc-style general knowledge about entities, or you build an applications that requires geographic, live science, or linguistic data, the datasets can be quite useful for you and the fact that they are partly interlinked can save you quite some work as you need to invest less effort into integrating them yourself.

 

On the other hand, if you expect complete coverage of all datasets that are relevant to your domain of interest or perfect data quality and currency, the Web of Linked Data obviously does not deliver this yet and the question is of course if it will deliver this in the future.

 

Personally, I think it is quite interesting to compare the deployment of Microdata/RDFa/Microformats and Linked Data on the Web. We also investigated the deployment of Microdata/RDFa/Microformats  [1][2] and the comparison currently looks like this:

 

1.       The overall number of websites publishing Microdata/RDFa/Microformats is three orders of magnitude larger than the number of websites publishing Linked Data.

2.       Topic wise, Microdata/RDFa/Microformats markup covers products, reviews, businesses, addresses, events, people, job postings and recipes. While Linked Data covers much more specific data from domains such as e-government, libraries, life science, linguistics or geography. So there is not too much overlap between the data that is published using the two technologies.

3.       In the context of Microdata/RDFa/Microformats, data providers do not set links pointing at data items in other datasets. In the Linked Data context, data providers do set such links to a certain extend. Not setting links of course reduces the effort required for data publishers (you just need to add some semantic markup to the PHP template that renders your website and you are done). On the other hand without such links, using the data within applications is much more painful. For an example on how much effort it took to integrate some Microdata describing products from different websites, see [3] (we needed sophisticated information extraction techniques to generate features from the product names and descriptions and then sophisticated identity resolution techniques to guess which descriptions refer to the same product).

4.       The Microdata/RDFa/Microformats are very shallow with usually only 3 or 4 attributes used to describe an entity and most interesting semantics only provided as free text (long product or job descriptions as text). In contrast, the data that is published as Linked Data is often much more structured (e-government, life science data, general-purpose KBs) and entities are described with more attributes (having kind of well-defined semantics) and is thus likely to enable more sophisticated applications.

 

Looking at this comparison, I think the empirical results nicely reflect the strengths of both technologies. Microdata/RDFa/Microformats aim at being a simple technology for annotating webpages that puts very little effort on webmasters in order to find wide-spread deployment (Guha made this point rather clear in his LDOW2014 keynote [4]). 

Linked Data on the other hand is a technology for sharing the data integration effort between data publishers and data consumers (the more effort publishers put into setting RDF links, the easier it becomes for data consumers to use the data). 

 

Thus, it makes sense that we see Linked Data adoption within communities that have an interest in making their data easy to use and thus are willing to invest effort into this, like libraries, government and science (with life science and language processing being the first communities adopting the technologies) and social networking.

 

ON the other hand it makes sense that we see wide-spread adoption of Microdata/RDFa/Microformats by communities that mostly want to push their data into Google applications in order to get more traffic/turnover for their sites/businesses and are thus not interested in linking to others (which are also likely their competitors).

 

Concerning your questions who did publish the datasets in the cloud (the data producers themselves or some third parties like interested hackers and other data enthusiasts), we did not investigate this in detail and I would be very happy if somebody else would do this. But my general feeling is that compared to 2011 more datasets are published by the actual data producers or parties close to them (for instance in the domains of e-government, libraries, or cross-domain knowledge bases).

 

This are my two cents to the overall discussion and I would be very happy to hear what others think about the message that can be drawn from the new diagram.

 

Cheers,

 

Chris

 

 

[1] http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/Bizer-etal-DeploymentRDFaMicrodataMicroformats-ISWC-InUse-2013.pdf

[2] http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/Meusel-etal-TheWDCMicrodataRdfaMicroformatsDataSeries-ISWC2014-rbds.pdf

[3] http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/petrovski_bryl_bizer_deos2014.pdf

[4] http://events.linkeddata.org/ldow2014/slides/ldow2014_keynote_guha_schema_org.pdf

 

 

 

 

Von: Giovanni Tummarello [mailto:g.tummarello@gmail.com] 
Gesendet: Sonntag, 17. August 2014 16:43
An: Christian Bizer
Cc: Linking Open Data
Betreff: Updated LOD Cloud Diagram - what is the message?

 

Chris hi, 

 

i would be interested in  discussing what is the message that will accompany this new version?

 

If i am not wrong there appear to be more bubbles than "last time here" so i wonder is the message that's going out with this diagram that  "adoption has increased" (e.g. as there were 200 and now there are 500)? 

 

if so, i do wonderif that is not misleading, based on this diagram alone.

 

For example how many of these are published by independent individuals or organizations (some IP technique might be handy here also)?  

 

That statusnet, gov.uk, bio2rdf etc has gone a bit more industrial and published plenty of dataset is good, but is that significative in evaluating a general data publishing technology? 

 

More interesting it would be: how many of these are private companies, not in the context of a publicly funded research projects? are there many that are just created by "hackers" or students just making a point? 

 

So many of the old datasets seem to have disappeared, what hapened to them? 

 

Are those that stayed alive really and used? (i see http://revyu.com who's biggest tag is "good beers from 2007" the year where it was used by people at the banff conference)

Is the usage really significant? (is see "apache" "o'reilly" - really?) 

 

So. bottom line. 

 

Sure one can say "hey we gave a definition and we're following it to create this diagram, everything else is out of the question".  

 

.. and sure it doesnt have to be YOU answeing all those questions above. (i guess your list of sites is public for other to investigate?).

 

I would however think it important that the message sent with this new diagram did its best to avoid being possibly misleading :) 

 

What are your thoughts?

Gio

 

 

 

On Fri, Aug 15, 2014 at 9:07 AM, Christian Bizer <chris@bizer.de> wrote:

Hi all,

on July 24th, we published a Linked Open Data (LOD) Cloud diagram containing
"crawlable" linked datasets and asked the community to point us at further
datasets that our crawler has missed [1].

Lots of thanks to everybody that did respond to our call and did enter
missing datasets into the DataHub catalog [2].

Based on your feedback, we have now drawn a draft version of the LOD cloud
containing:
1.      the datasets that our crawler discovered
2.      the datasets that did not allow crawling
3.      the datasets you pointed us at.

The new version of the cloud altogether contains 558 linked datasets which
are connected by altogether 2883 link sets. As we were pointed at quite a
number of linguistic datasets [3], we added linguistic data as a new
category to the diagram.

The current draft version of the LOD Cloud diagram is found at:

http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/ISWC-RDB/extendedLO <http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/ISWC-RDB/extendedLODCloud/extendedCloud.png> 
DCloud/extendedCloud.png

Please note that we only included datasets that are accessible via
dereferencable URIs and are interlinked with other datasets.

It would be great if you could check if we correctly included your datasets
into the diagram and whether we missed some link sets pointing from your
datasets to other datasets.

If we did miss something, it would be great if you could point us at what we
have missed and update your entry in the DataHub catalog [2] accordingly.

Please send us feedback until August 20th. Afterwards, we will finalize the
diagram and publish the final August 2014 version.

Cheers,

Chris, Max and Heiko

--
Prof. Dr. Christian Bizer
Data and Web Science Research Group
Universität Mannheim, Germany 
chris@informatik.uni-mannheim.de
www.bizer.de
Received on Monday, 18 August 2014 09:07:14 UTC