Re: [HELP] Can you please update information about your dataset? from Richard Cyganiak on 2009-08-12 (public-lod@w3.org from August 2009)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Wed, 12 Aug 2009 13:10:22 +0100
To: Hugh Glaser <hg@ecs.soton.ac.uk>
Cc: Aldo Bucchi <aldo.bucchi@gmail.com>, Kingsley Idehen <kidehen@openlinksw.com>, Leigh Dodds <leigh.dodds@talis.com>, Jun Zhao <jun.zhao@zoo.ox.ac.uk>, "public-lod@w3.org" <public-lod@w3.org>, Anja Jentzsch <anja@anjeve.de>
Message-Id: <7E80E8B1-03BF-4FFE-806D-816ECB66BAC1@cyganiak.de>
The problem at hand is: How to get reasonably accurate and up-to-date  
statistics about the LOD cloud?

I see three workable methods for this.

1. Compile the statistics from voiD descriptions published by  
individual dataset maintainers. This is what Hugh proposes below.  
Enabling this is one of the main reason why we created voiD. There has  
to be better tools for creating voiD before this happens. The tools  
could be, for example, manual entry forms that spit out voiD (voiD-o- 
matic?), or analyzers that read a dump and spit out a skeleton voiD  
file.

2. Hand-compile the statistics by watching public-lod, trawling  
project home pages, emailing dataset maintainers, and fixing things  
when dataset maintainers complain. This is how I created the original  
LOD cloud diagram in Berlin, and after I left Berlin, Anja has done a  
great job keeping it up to date despite its massive growth. We will  
continue to update it on a best-effort basis for the foreseeable  
future. A voiD version of the information underlying the diagram is in  
the pipeline. Others can do as we did.

3. Anyone who has a copy of a big part of the cloud (e.g. OpenLink and  
we at Sindice) can potentially calculate the statistics. This is non- 
trivial because we just have triples, and we need to reverse-engineer  
datasets and linksets from them, it involves computation over quite  
serious amounts of data, and in the end you still won't have good  
labels or homepages for the datasets. While this approach is possible,  
it seems to me that there are better uses of engineering and research  
resources.

There is a fourth process that, IMO, does NOT work:

4. Send an email to public-lod asking "Everyone please enter your  
dataset in this wikipage/GoogleSpreadsheet/fancyAppOfTheWeek."

Best,
Richard


On 11 Aug 2009, at 22:07, Hugh Glaser wrote:
> If any more work is to be put into generating this picture, it  
> really should be from voiD descriptions, which we already make  
> available for all our datasets.
> And for those who want to do it by hand, a simple system to allow  
> them to specify the linkage using voiD would get the entry into a  
> format for the voiD processor to use (I'm happy to host the data if  
> need be).

> Or Aldo's system could generate its RDF using the voiD ontology,  
> thus providing the manual entry system?
>
> I know we have been here before, and almost got to the voiD  
> processor thing:- please can we try again?
>
> Best
> Hugh
>
> On 11/08/2009 19:00, "Aldo Bucchi" <aldo.bucchi@gmail.com> wrote:
>
> Hi,
>
> On Aug 11, 2009, at 13:46, Kingsley Idehen <kidehen@openlinksw.com>
> wrote:
>
>> Leigh Dodds wrote:
>>> Hi,
>>>
>>> I've just added several new datasets to the Statistics page that
>>> weren't previously listed. Its not really a great user experience
>>> editing the wiki markup and manually adding up the figures.
>>>
>>> So, thinking out loud, I'm wondering whether it might be more
>>> appropriate to use a Google spreadsheet and one of their submission
>>> forms for the purposes of collectively the data. A little manual
>>> editing to remove duplicates might make managing this data a little
>>> more easier. Especially as there are also pages that separately list
>>> the available SPARQL endpoints and RDF dumps.
>>>
>>> I'm sure we could create something much better using Void, etc but
>>> for
>>> now, maybe using a slightly better tool would give us a little more
>>> progress? It'd be a snip to dump out the Google Spreadsheet data
>>> programmatically too, which'd be another improvement on the current
>>> situation.
>>>
>>> What does everyone else think?
>>>
>> Nice Idea! Especially as Google Spreadsheet to RDF is just about
>> RDFizers for the Google Spreadsheet API :-)
>
> Hehe. I have this in my todo (literally). A website that exposes a
> google spreadsheet as SPARQL endpoint. Internally we use it as UI to
> quickly create config files et Al.
> But It will remain in my todo forever...;)
>
> Kingsley, this could be sponged. The trick is that the spreadsheet
> must have an accompanying page/sheet/book with metadata (the NS or
> explicit URIs for cols).
>
>>
>> Kingsley
>>> Cheers,
>>>
>>> L.
>>>
>>> 2009/8/7 Jun Zhao <jun.zhao@zoo.ox.ac.uk>:
>>>
>>>> Dear all,
>>>>
>>>> We are planning to produce an updated data cloud diagram based on
>>>> the
>>>> dataset information on the esw wiki page:
>>>> http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSets/Statistics
>>>>
>>>> If you have not published your dataset there yet and you would
>>>> like your
>>>> dataset to be included, can you please add your dataset there?
>>>>
>>>> If you have an entry there for your dataset already, can you
>>>> please update
>>>> information about your dataset on the wiki?
>>>>
>>>> If you cannot edit the wiki page any more because the recent
>>>> update of esw
>>>> wiki editing policy, you can send the information to me or Anja,
>>>> who is
>>>> cc'ed. We can update it for you.
>>>>
>>>> If you know your friends have dataset on the wiki, but are not on
>>>> the
>>>> mailing list, can you please kindly forward this email to them? We
>>>> would
>>>> like to get the data cloud as up-to-date as possible.
>>>>
>>>> For this release, we will use the above wiki page as the  
>>>> information
>>>> gathering point. We do apologize if you have published information
>>>> about
>>>> your dataset on other web pages and this request would mean extra
>>>> work for
>>>> you.
>>>>
>>>> Many thanks for your contributions!
>>>>
>>>> Kindest regards,
>>>>
>>>> Jun
>>>>
>>>>
>>>> ______________________________________________________________________
>
>
>>>> This email has been scanned by the MessageLabs Email Security
>>>> System.
>>>> For more information please visit http://www.messagelabs.com/email
>>>> ______________________________________________________________________
>
>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>>
>>
>> Regards,
>>
>> Kingsley Idehen          Weblog: http://www.openlinksw.com/blog/ 
>> ~kidehen
>> President & CEO OpenLink Software     Web: http://www.openlinksw.com
>>
>>
>>
>>
>>
>
>
>
Received on Wednesday, 12 August 2009 12:19:03 UTC