Re: Updated LOD Cloud Diagram - what is the message? from Giovanni Tummarello on 2014-08-18 (public-lod@w3.org from August 2014)

From: Giovanni Tummarello <g.tummarello@gmail.com>
Date: Mon, 18 Aug 2014 12:52:05 +0200
To: Christian Bizer <chris@bizer.de>
Cc: Linking Open Data <public-lod@w3.org>
Message-ID: <CAHHRs7jMpcDt-gf+L-TgDcKK2rqByvXXYkQpZeyc=b_xUKUDvg@mail.gmail.com>
Hi Chris, this is interesting, and its great you're looking also at the
world of marked up data.

my 2c shortly



> If you build an application that requires
> DBpedia/YAGO/Freebase/UMBEL/Cyc-style general knowledge about entities, or
> you build an applications that requires geographic, live science, or
> linguistic data, the datasets can be quite useful for you and the fact that
> they are partly interlinked can save you quite some work as you need to
> invest less effort into integrating them yourself.
>
>
sure, under the assumption that the interlinks (which are provided as best
effort by the producers) are of reasonable enough quality for your
application. As we know quality might be strongly related to application
e.g. in certain applications you might need more precisions, in certain
others recall etc.
certainly what is there provides a starting point however, courtesy again
of the best efforts of those few.

It is to be asked how linked data (dereferenciable uris etc) really helps
in fostering the quality of such interlinkage e.g. do people really have
mechanisms in place that resolve such uris to check the entity on the other
end or do they just download the dataset, convert it to something way
flatter, use disambiguation/interlining processes and then publish back?

...  but in fairness, the fact that you can look at a single "Record" and
somehow see as a human that it has a link to another dataset is per se
likely to have some positive effect on the willingness of people to indeed
go and do such interlinking.


Personally, I think it is quite interesting to compare the deployment of
> Microdata/RDFa/Microformats and Linked Data on the Web. We also
> investigated the deployment of Microdata/RDFa/Microformats  [1][2] and the
> comparison currently looks like this:
>

>
1.       The overall number of websites publishing
> Microdata/RDFa/Microformats is three orders of magnitude larger than the
> number of websites publishing Linked Data.
>
> 2.       Topic wise, Microdata/RDFa/Microformats markup covers products,
> reviews, businesses, addresses, events, people, job postings and recipes.
> While Linked Data covers much more specific data from domains such as
> e-government, libraries, life science, linguistics or geography. So there
> is not too much overlap between the data that is published using the two
> technologies.
>
> 3.       In the context of Microdata/RDFa/Microformats, data providers do
> not set links pointing at data items in other datasets. In the Linked Data
> context, data providers do set such links to a certain extend. Not setting
> links of course reduces the effort required for data publishers (you just
> need to add some semantic markup to the PHP template that renders your
> website and you are done). On the other hand without such links, using the
> data within applications is much more painful. For an example on how much
> effort it took to integrate some Microdata describing products from
> different websites, see [3] (we needed sophisticated information extraction
> techniques to generate features from the product names and descriptions and
> then sophisticated identity resolution techniques to guess which
> descriptions refer to the same product).
>
> 4.       The Microdata/RDFa/Microformats are very shallow with usually
> only 3 or 4 attributes used to describe an entity and most interesting
> semantics only provided as free text (long product or job descriptions as
> text). In contrast, the data that is published as Linked Data is often much
> more structured (e-government, life science data, general-purpose KBs) and
> entities are described with more attributes (having kind of well-defined
> semantics) and is thus likely to enable more sophisticated applications.
>
>
>
> Looking at this comparison, I think the empirical results nicely reflect
> the strengths of both technologies. Microdata/RDFa/Microformats aim at
> being
>


This is quite interesting, but isnt this conclusion neglecting a huge fact..

how many people that professionally work with "e-government, libraries,
life science, linguistics or geography" use linked data technology format
vs other formats that are of relevance in that world? could it be again
around 3 orders of magnitude?

lets take the simples format, CSV
how would your 1 2 3 4  answers be with CSV included.

wouldnt we say that 2 orders ofmagnitude more datasets are published in
CSV, they are also much more complete than those of microdata/microformats,
definitely not less complete than those published in RDF and might or might
not include identifiers that can link them to others



> a simple technology for annotating webpages that puts very little effort
> on webmasters in order to find wide-spread deployment (Guha made this point
> rather clear in his LDOW2014 keynote [4]).
>



>  Linked Data on the other hand is a technology for sharing the data
> integration effort between data publishers and data consumers (the more
> effort publishers put into setting RDF links, the easier it becomes for
> data consumers to use the data).
>


This is the core point of what we should be "demostrating"... if "linked
data" as it is.. does indeed work, or we should be "working on" e.g.
improving the standard to increase adoption.

So, i think we might want to compare linked data effectiveness and adoption
vs other ways that in these very fields and by these very people data has
been traditionally been shared and consumed

It would be more relevant than comparing with microformants/microdata which
as you say start with a different idea in mind.



>
>
> Thus, it makes sense that we see Linked Data adoption within communities
> that have an interest in making their data easy to use and thus are willing
> to invest effort into this, like libraries, government and science (with
> life science and language processing being the first communities adopting
> the technologies) and social networking.
>


It might make sense to see but would have to be proven/compared against
their previous standard as above.



>
>
> Concerning your questions who did publish the datasets in the cloud (the
> data producers themselves or some third parties like interested hackers and
> other data enthusiasts), we did not investigate this in detail and I would
> be very happy if somebody else would do this. But my general feeling is
> that compared to 2011 more datasets are published by the actual data
> producers or parties close to them (for instance in the domains of
> e-government, libraries, or cross-domain knowledge bases).
>
>

sure, there certainly was was some amount of succesfull selling expecially
in public bodies. Good thing or red harring..?


>
>
> This are my two cents to the overall discussion and I would be very happy
> to hear what others think about the message that can be drawn from the new
> diagram.
>
>
>


Thanks again for the effort and your reply!
Gio
Received on Monday, 18 August 2014 10:52:53 UTC