Re: Next version of the LOD cloud diagram. Please provide input, so that your dataset is included. from Alan Ruttenberg on 2010-09-05 (public-lod@w3.org from September 2010)

From: Alan Ruttenberg <alanruttenberg@gmail.com>
Date: Sun, 5 Sep 2010 11:00:55 -0400
To: Chris Bizer <chris@bizer.de>
Cc: Anja Jentzsch <anja@anjeve.de>, public-lod@w3.org, Leigh Dodds <leigh.dodds@talis.com>, Jonathan Gray <jonathan.gray@okfn.org>, info@okfn.org
Message-ID: <AANLkTin8_X8UYeGQ13zBJvoWU4EZMvh1Q9cAQN30RTka@mail.gmail.com>
On Sun, Sep 5, 2010 at 5:08 AM, Chris Bizer <chris@bizer.de> wrote:
> Hi Alan,
>
>> I have just spent some time evaluating one source and reported to you
>> the result. Perhaps you might act on this investment in time and thank
>> me for doing so. You might find that the result was myself and more
>> people doing such quality control.
>
> Sorry that my reply yesterday might have been a bit too harsh.
>
> I have looked up the CAS license (http://www.cas.org/legal/infopolicy.html)
> and added a reference to the description of the CAS dataset at
>
> http://ckan.net/package/bio2rdf-cas
>
> Please also note that CKAN provides a rating function for the datasets and
> also provides for commenting and discussing the datasets.
>
> Maybe people could use these features as a start to collect quality-related
> meta-information about the datasets.
>
> CKAN also provides a link to the http://www.isitopendata.org/ service, which
> might be used for license inquiries.

Dear Chris,

As I said, the first line on the CKAN home page says: "CKAN is a
registry of open data and content packages.". Therefore I think there
is a reasonable expectation that the packages registered there are
open. I maintain that CKAN should either change how it explains itself
to make clear that it is a registry of packages that may or may not be
open, or it should remove the packages that are not known to be open.
I'm not taking a position one way or another which they should do
(that's their business), but they should say what they do, and do what
they say.

Thank you for your pointers to further information on how to find
licenses. I'm fairly familiar with this area given that I work for
Creative Commons.

> I agree with you that the quality of Linked Data published on the Web is
> crucial, but we also have to take into account that much of the data in the
> LOD cloud is currently still published by research projects in order to
> demonstrate the technologies.
>
> As the Web of Data is evolving and more and more actual owners of the
> datasets start to provide them as Linked Data, I hope that the quality will
> also increase and the datasets will be keep current. Encouraging
> developments into this direction currently happen in the libraries,
> eGovernment, and eCommerce domains.

I agree that these are good examples. I would suggest that you focus
on including the good examples in the LOD cloud, or at a minimum
remove those, like CAS, that fall below the minimal standard of
supplying *some* data and being *open*, so that "linked open data"
means something coherent.

> On the other hand, the Web is an open system and we will thus always see
> people publishing low-quality, wrong and misleading data. Google handles
> this fact rather successfully using PageRank. As the Web of Data provides
> more structure then the classic Web, I think we might even be able to apply
> more sophisticated data-quality assessment heuristics to decide which data
> we want to use in our applications and which to ignore. Some of these
> methods are listed in [1].

Look, Chris, I just did a "manual page rank" on the CAS dataset. It is
meaningless.  This is a high quality assessment. If the movement can't
act on known good quality information I (and others) will doubt that
automatic algorithms will be credible.

Moreover, the LOD cloud diagram is an advertisement. There are enough
data sets now that inclusion in the diagram can become a reward for
good work. It's not good advertising for Google when junk sites come
up at the top of search results and they do their best to minimize
this occurrence. The LOD cloud is your front page, and to a certain
extent mine as well as I invest all my time in doing work towards
building the web of data in the Sciences.

Regards,
Alan

>
> Best,
>
> Chris
>
> [1] Christian Bizer, Richard Cyganiak: Quality-driven information filtering
> using the WIQA policy framework. Journal of Web Semantics: Science, Services
> and Agents on the World Wide Web, Volume 7, Issue 1, January 2009, Pages
> 1-10.
> http://dx.doi.org/10.1016/j.websem.2008.02.005
>
>
> -----Ursprüngliche Nachricht-----
> Von: Alan Ruttenberg [mailto:alanruttenberg@gmail.com]
> Gesendet: Samstag, 4. September 2010 22:20
> An: Chris Bizer
> Cc: Anja Jentzsch; public-lod@w3.org; Leigh Dodds; Jonathan Gray
> Betreff: Re: Next version of the LOD cloud diagram. Please provide input, so
> that your dataset is included.
>
> On Sat, Sep 4, 2010 at 3:43 PM, Chris Bizer <chris@bizer.de> wrote:
>> So rather than to criticize the work that other people do on collecting
>> meta-information about the datasets in the LOD cloud
>
> Did you read what I wrote? I made no comment on the adequacy of
> metainformation. In fact I *used* that metainformation to point out
> that the data source in question did not satisfy the "open" provision
> of linked *open* data. In addition I criticized the *inclusion* of the
> data set in the *lod cloud diagram* because of this lack of openness
> and because the actual content of that resource didn't resemble any
> data in the resource that it was derived from (a registry of
> information about chemical compounds), suggesting that it would hurt
> the LOD effort as inclusion would be a kind of "false advertising".
>
> -Alan
>
>
Received on Sunday, 5 September 2010 15:01:45 UTC