Re: AW: ANN: LOD Cloud - Statistics and compliance with best practices from Giovanni Tummarello on 2010-10-21 (public-lod@w3.org from October 2010)

From: Giovanni Tummarello <giovanni.tummarello@deri.org>
Date: Thu, 21 Oct 2010 13:12:10 +0100
To: Chris Bizer <chris@bizer.de>
Cc: Martin Hepp <martin.hepp@ebusiness-unibw.org>, Thomas Steiner <tsteiner@google.com>, Semantic Web <semantic-web@w3.org>, public-lod <public-lod@w3.org>, Anja Jentzsch <anja@anjeve.de>, semanticweb <semanticweb@yahoogroups.com>, Kingsley Idehen <kidehen@openlinksw.com>
Message-ID: <AANLkTi=0VTwhMS+BJ=s+q9n5+oboDjwVey=HgUvHaV4T@mail.gmail.com>

> But again: I agree that crawling the Web of Data and then deriving a dataset
> catalog as well as meta-data about the datasets directly from the crawled
> data would be clearly preferable and would also scale way better.
>
> Thus: Could please somebody start a crawler and build such a catalog?
>
> As long as nobody does this, I will keep on using CKAN.
>

Hi Chris, all

I can only restate that within Sindice we're very open to anyone who
wanted to develop data anlisys apps creating catalogs automatically.
At the moment a map reduce job a couple of week ago gave an excess of
100k independent datasets. How many interlinked etc? to be analyzed.

Our interest (and the interest of the Semantic Web vision i want to
sposor) is to make sure RDFa sites are fully included and so are those
who provide markup which can however be translated in an
automatic/agreeable way (so no scraping or "sponging") into RDF. (that
is anything that any23.org can turn into triples)

If you were indeed interested in running your or developing your
algorithms in our running dataset no problem, the code can be made
opensource so it would run on others similarly structured datasets.

This said yes i think too that in this phase a CKAN like repository
can be an interesting aggregation point, why not.

 But i do think the diagram, which made great sense as an example when
Richard started it is now at risk of providing a disservice
which is in line which what Martin is making noticed.

The diagram as it is now kinda implicitly conveys the sense that if
something is so large then all that matters must be there and that's
absolutely not the case.

a) there are plenty of extremely useful datasets is RDF/RDFa etc which
are not there
b) the usefulness of being linked is all but a proven fact, so on the
one hand people might want to "be there" on the other you'd have to do
pushing toward serious commercial entities (for example) to "link to
dbpedia" for reasons that arent clear and that hurts your credibility.

So danny ayers has fun linking to dbpedia so he is in there with his
joke dataset, but you cant credibly bring that argument to large
retailers so they're left out?

this would be ok if the diagram was just "hey its my own thing i set
my rules" - fine but the fanfare around it gives it a different
meaning and thus the controversy above.

.. just tried to put in words what might be a general unspoken feeling..

Short message recap
a) ckan - nice why not might be useful but..
b) generated diagram : we have the data or can collect it so whoever
is interested in analitics pls let us know and we can work it out
(matter of fact it turns out most uf us in here are paid by EU for
doing this in collaborative projects :-) )

cheers
Giovanni

Received on Thursday, 21 October 2010 12:12:39 UTC