W3C home > Mailing lists > Public > public-lod@w3.org > October 2010

AW: AW: ANN: LOD Cloud - Statistics and compliance with best practices

From: Chris Bizer <chris@bizer.de>
Date: Thu, 21 Oct 2010 13:14:08 +0200
To: "'Martin Hepp'" <martin.hepp@ebusiness-unibw.org>, "'Thomas Steiner'" <tsteiner@google.com>
Cc: "'Semantic Web'" <semantic-web@w3.org>, <public-lod@w3.org>, "'Anja Jentzsch'" <anja@anjeve.de>, <semanticweb@yahoogroups.com>, "'Kingsley Idehen'" <kidehen@openlinksw.com>
Message-ID: <00cd01cb7111$14caf860$3e60e920$@bizer.de>
Hi Martin, Thomas and Kingsley,

> >> First, I think it is pretty funny that you list Denny's April's fool
dataset of
> creating triples for numbers as an acceptable part of the cloud,

Why? 

As I said, we are including all datasets which fulfill the minimal technical
requirements.
As Denny's dataset does this it is included. The same would of course be
true for BestBuy and other GoodRelations datasets, if they would be
connected by RDF links to other datasets in the cloud. 

> >> The fundamental mistake of what you say is that linked open e-commerce
> data is not "a dataset" but a wealth of smaller datasets. Asking me to
create
> CKAN entries for each store or business in the world that provides
> GoodRelations data is as if Google was asking any site owner in the world
to
> register his or her site manually via CKAN.
> >>
> >> That is 1990s style and does not have anything to do with a "Web" of
data.

I agree with you that it would be much better, if somebody would set up a
crawler, properly crawl the Web of Data and then provide a catalog about all
datasets. As long as nobody does this, I think it is useful to have the
manually maintained CKAN catalog as a first step. 

An interesting step into this direction if the profiling work done by Felix
Naumann's group for the BTC dataset. See
http://www.cs.vu.nl/~pmika/swc/submissions/swc2010_submission_3.pdf

> >> Is HTML + RDFa with hash fragments, available via HTTP GET
> "dereferencable" for you? E.g.

Absolutely!

> >> To be frank, I think the bubbles diagram fundamentally misses the point
in
> the sense that the power of linked data is in integrating a huge amount of
> small, specific data sources, and not in linking a manually maintained
blend of
> ca. 100 monolithic datasets.

Valid point.  I agree with you that the power of the Linked Data
architecture is that it provides for building a single global dataspace
which of course may contain small as well as big data sources.
 
The  goal of the LOD diagram is not to visualize any small chunk of RDF on
the Web, as this would be impossible for obvious reasons - including the
size of your screen. 

We restrict the diagram to bigger datasets, hoping that these may be
especially relevant to data consumers.

Of course, you may disagree with this restriction.

>From Thomas:
> > How about handling GoodRelations the same way as FOAF, representing it
> > as a somewhat existing bubble without exactly specifying where it
> > links to and from where inbound links come from

We also don't do this for FOAF anymore in the new version of the diagram.

>From Thomas:
> > In the end, the idea of a Web catalogue was mostly abandoned at some
> > point due to being unmanageable, maybe the same happens to the Web
> > /data/ "catalogue", aka. LOD cloud (the metaphor doesn't work
> > perfectly, but you get the point).

Yes. But I personally think that the Yahoo catalog was rather useful in the
early days of the Web.

In the same way, I think that the CKAN catalog is rather useful in the
current development stage of the Web of Data and I'm looking forward to the
time, when the Web of Data has grown to a point where such a catalog becomes
unmanageable.

But again: I agree that crawling the Web of Data and then deriving a dataset
catalog as well as meta-data about the datasets directly from the crawled
data would be clearly preferable and would also scale way better.

Thus: Could please somebody start a crawler and build such a catalog?

As long as nobody does this, I will keep on using CKAN.

Cheers,

Chris
 

> -----Ursprüngliche Nachricht-----
> Von: semantic-web-request@w3.org [mailto:semantic-web-
> request@w3.org] Im Auftrag von Kingsley Idehen
> Gesendet: Mittwoch, 20. Oktober 2010 20:30
> An: Thomas Steiner
> Cc: Martin Hepp; Chris Bizer; Semantic Web; public-lod@w3.org; Anja
> Jentzsch; semanticweb@yahoogroups.com
> Betreff: Re: AW: ANN: LOD Cloud - Statistics and compliance with best
> practices
> 
> On 10/20/10 2:13 PM, Thomas Steiner wrote:
> > Hi all,
> >
> > How about handling GoodRelations the same way as FOAF, representing it
> > as a somewhat existing bubble without exactly specifying where it
> > links to and from where inbound links come from (on the road right
> > now, so can't check for sure whether it is already done this way)? The
> > individual datasets are too small to be entered manually into CKAN (+1
> > for Martin's arguments here).
> > In the end, the idea of a Web catalogue was mostly abandoned at some
> > point due to being unmanageable, maybe the same happens to the Web
> > /data/ "catalogue", aka. LOD cloud (the metaphor doesn't work
> > perfectly, but you get the point).
> >
> > Martin's point as I get it is that GR forms part of the Web of data.
> > Currently this is (about to be) honored by search engines and the
> > like, GR-enabled price/product comparison engines etc. are probably
> > being worked on (or are already live?), so Linked Open Commerce (well,
> > an aspect of it) will be/is real soon/now. Whether/how GR forms part
> > of the LOD cloud is a secondary, if at all, question in my humble
> > opinion.
> >
> > All this is my private point of view, my Google hat completely off.
> > Sorry for the many slashes/alternative sentence endings.
> 
> This is why we opted to make a LOC (Linked Open Commerce) pictorial [1]
> that connects to LOD. In short, I would encourage all Linked Data
> publishers and curators to embark upon similar endeavors, as long as
> they accurately depict their specific Linked Data slant and
> contributions. Remember, this is about the Web, LOD is just one of many
> Linked Data clusters within the burgeoning Web of Linked Data :-)
> 
> Links:
> 
> 1. http://linkedopencommerce.com -- this space includes variety of
> purpose specific Linked Data pictorials.
> 
> Kingsley
> > Best,
> > Tom
> >
> > Thank God not sent from a BlackBerry, but from my iPhone
> >
> > On 20.10.2010, at 19:16, Martin Hepp<martin.hepp@ebusiness-unibw.org>
> wrote:
> >
> >> Hi Chris:
> >>
> >> First, I think it is pretty funny that you list Denny's April's fool
dataset of
> creating triples for numbers as an acceptable part of the cloud,
> >>
> >>     http://ckan.net/package/linked-open-numbers
> >>
> >> <Picture 39.png>
> >> (right next to WordNet)
> >>
> >> The fundamental mistake of what you say is that linked open e-commerce
> data is not "a dataset" but a wealth of smaller datasets. Asking me to
create
> CKAN entries for each store or business in the world that provides
> GoodRelations data is as if Google was asking any site owner in the world
to
> register his or her site manually via CKAN.
> >>
> >> That is 1990s style and does not have anything to do with a "Web" of
data.
> >>
> >>> 1.Data items are accessible via dereferencable URIs (provding only
> access
> >>> via SPARQL is not enough, as linked data browsers and search engines
> cannot
> >>> work with SPARQL endpoints)
> >> Is HTML + RDFa with hash fragments, available via HTTP GET
> "dereferencable" for you? E.g.
> >>
> >>    http://stores.bestbuy.com/10/
> >>
> >> If yes, fine. If not - why? IMO, HTML with RDFa payload does not brake
> any fundamental principles of the Web architecture.
> >>
> >>
> >>> 2.The dataset sets at least 50 RDF links pointing at other datasets or
at
> >>> least one other dataset is setting 50 RDF links pointing at your
dataset.
> >>
> >> This is often hard to meet and seems like a very artificial requirement
to
> me.
> >>
> >> First, many small datasets may be just 50 triples in total. Why should
a
> hairdresser in Kentucky, exposing its description in GoodRelations + RDFa
> have 50 outbound links? What should this beauty store in CA exposing 800
> triples do to qualify as linked data?
> >>
> >> http://www.plushbeautybar.com/services.html
> >>
> >> Second, what kind of links to core LOD entities do you expect from shop
> operators? For example, take
> >>
> >>     http://semantic.eurobau.com/
> >>
> >> That dataset contains some 30 million triples of construction-materials
> information. Which links to dbPedia would you reasonably expect? Is this
> Linked Data in your opinion or not? If not, why?
> >>
> >> To be frank, I think the bubbles diagram fundamentally misses the point
in
> the sense that the power of linked data is in integrating a huge amount of
> small, specific data sources, and not in linking a manually maintained
blend of
> ca. 100 monolithic datasets.
> >>
> >> Integrating 100 datasets does not have anything to do with Web-scale
> information integration. Note that Google estimated back in 2008 that
there
> were ca. 1 trillion URIs in their index alone. So what are 100 manually
> converted datasets in comparison to that?
> >>
> >> Best
> >>
> >> Martin
> >>
> >> On 20.10.2010, at 08:49, Chris Bizer wrote:
> >>
> >>> Hi Martin,
> >>>
> >>> we are not ignoring anything.
> >>>
> >>> I personally think that http://linkedopencommerce.com/ is an quite
> exciting
> >>> effort and would love to see more e-commerce data in the LOD cloud.
> >>>
> >>> We have asked the community repeatedly to provide information about
> datasets
> >>> that they like to be included into the LOD cloud on CKAN.
> >>>
> >>> You did not do this. And at this time, we also did not hear about
> >>> http://linkedopencommerce.com/ yet.
> >>>
> >>> It would be great, if you would add information about your dataset(s)
to
> >>> CKAN, so that we can include it into the next version of the cloud
> diagram.
> >>>
> >>> Of course given that they fulfill the minimal requirements for
inclusion,
> >>> which are:
> >>>
> >>> 1.Data items are accessible via dereferencable URIs (provding only
> access
> >>> via SPARQL is not enough, as linked data browsers and search engines
> cannot
> >>> work with SPARQL endpoints)
> >>> 2.The dataset sets at least 50 RDF links pointing at other datasets or
at
> >>> least one other dataset is setting 50 RDF links pointing at your
dataset.
> >>>
> >>> Cheers,
> >>>
> >>> Chris
> >>>
> >>> -----Ursprüngliche Nachricht-----
> >>> Von: Martin Hepp [mailto:martin.hepp@ebusiness-unibw.org]
> >>> Gesendet: Dienstag, 19. Oktober 2010 22:09
> >>> An: Anja Jentzsch; Chris Bizer
> >>> Cc: Semantic Web; semanticweb@yahoogroups.com
> >>> Betreff: Re: ANN: LOD Cloud - Statistics and compliance with best
> practices
> >>>
> >>> Hi Anja, Chris:
> >>>
> >>> It's kind of a joke that you ignore the 1 billion triples of
> >>> GoodRelations data on the Web, e.g. available at
> >>>
> >>>   http://linkedopencommerce.com/
> >>>
> >>> or
> >>>
> >>>   http://www.ebusiness-unibw.org/wiki/
> >>> GoodRelations#Examples_in_the_Wild
> >>>
> >>> Martin
> >>>
> >>>
> >>> On 19.10.2010, at 17:56, Anja Jentzsch wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> in the last weeks, we have analyzed which data sources in the new
> >>>> version of the LOD cloud comply to various best practices that are
> >>>> recommended by W3C or have emerged within the LOD community.
> >>>>
> >>>> We have checked the implementation of the following nine best
> >>>> practices:
> >>>>
> >>>> 1. Provide dereferencable URIs
> >>>> 2. Set RDF links pointing at other data sources
> >>>> 3. Use terms from widely deployed vocabularies
> >>>> 4. Make proprietary vocabulary terms dereferencable
> >>>> 5. Map proprietary vocabulary terms to other vocabularies
> >>>> 6. Provide provenance metadata
> >>>> 7. Provide licensing metadata
> >>>> 8. Provide data-set-level metadata
> >>>> 9. Refer to additional access methods
> >>>>
> >>>> The compliance with the best practices was either checked manually
> >>>> or by using scripts that downloaded and analyzed some data from the
> >>>> data sources.
> >>>> We have added the results of the evaluation in the form of tags to
> >>>> the LOD data set catalog on CKAN [1].
> >>>>
> >>>> We are now happy to release the first statistics about the structure
> >>>> of the LOD could as well as the compliance of the datasets with the
> >>>> best practices.
> >>>> The statistics can be found here:
> >>>>
> >>>> http://www4.wiwiss.fu-berlin.de/lodcloud/state/
> >>>>
> >>>> The document contains an initial, preliminary release of the
> >>>> statistics. If you spot any errors in the data describing the LOD
> >>>> data sets on CKAN, it would be great if you would correct them
> >>>> directly on CKAN.
> >>>>
> >>>> For information on how to describe datasets on CKAN please refer to
> >>>> the Guidelines for Collecting Metadata on Linked Datasets in CKAN
[2].
> >>>>
> >>>> After your feedback and corrections, we will then move the corrected
> >>>> version of the statistics to http://www.lod-cloud.net/ (around
> >>>> October 24th).
> >>>>
> >>>> Have fun with the statistics and the encouraging as well as
> >>>> disappointing insights that they provide.
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Chris Bizer, Anja Jentzsch and Richard Cyganiak
> >>>>
> >>>> [1] http://www.ckan.net/group/lodcloud
> >>>> [2]
> >>>
> http://esw.w3.org/TaskForces/CommunityProjects/LinkingOpenData/DataS
> ets/CKAN
> >>> metainformation
> >>>>
> >>>
> >
> 
> 
> --
> 
> Regards,
> 
> Kingsley Idehen
> President&  CEO
> OpenLink Software
> Web: http://www.openlinksw.com
> Weblog: http://www.openlinksw.com/blog/~kidehen
> Twitter/Identi.ca: kidehen
> 
> 
> 
> 
Received on Thursday, 21 October 2010 11:12:14 UTC

This archive was generated by hypermail 2.3.1 : Sunday, 31 March 2013 14:24:29 UTC