Re: Contd: Visualizing LOD Linkage from Aldo Bucchi on 2008-08-02 (public-lod@w3.org from August 2008)

From: Aldo Bucchi <aldo.bucchi@gmail.com>
Date: Sat, 2 Aug 2008 17:35:22 -0400
To: "Giovanni Tummarello" <giovanni.tummarello@deri.org>, "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <7a4ebe1d0808021435y47caf431ya4843e979c5bce90@mail.gmail.com>
Hi Giovanni,

On Sat, Aug 2, 2008 at 2:27 PM, Giovanni Tummarello
<giovanni.tummarello@deri.org> wrote:
>
> Hi guys,
>
> i think visualizing the linkage between dataset is an interesting
> goal. To this end we're going to be starting some experiments within
> sindice to try to come up with a complete and automatic map fo such
> linkings.
>
> for once its not the just the "cool" factor we're after: such map
> should help applicaiton implementers understand what kind of query
> they can perform and what they can hope to obtain.

Sounds great. Sindice is going to become a key piece in moving this forward.

IMO, even better than a static visualization ( which would not scale
well, and probably become too complex to interpret ) what would be
needed is a tool.
Are you planning on opening a some sort of VoiD introspection API for
the indexed content in Sindice?

This "tool" sounds like a place for community work and a lot of trial
and error, so it would be nice to have a solid API to start from.

Moving forward a bit, I imagine that sparql query builders like
OpenLink's could, eventually, become integrated with such API to allow
some introspection into the available datasets. Off the top of my
head: auto complete when composing, activate/deactivate datasets when
executing, etc.

And just a guess... this API shouldn't belong to Sindice. It should be
part of the semweb... introspection. Agents like sindice should
announce themselves and be discoverable ( this is obviously something
old ). One new DNS-ish layer. And it should be RDF too.


>
> The technologies involved in such analisys are pretty cool, an hadoop
> job reads over our hbase repositories of rdf and microformats and
> should compose the map.. think of how many things could be done more
> in terms of analisys with such technologies and the 28 core cluster we
> have now (and hopefully with the 800 core cluster we might have in 2-3
> months, more news later).
>
> .. fancy to write such hadoop jobs and deliver cool large scale apis
> to the linked data communities? why not come over for an internship or
> a visiting  research period. This infrastructure is available so.. and
> we're open to share it. More stable positions also available.
>
> Giovanni
>
> p.s. in the meanwhile we should be able to show soon a query that
> lists all the same as connection from a dataset to another, hopefully
> in a few days.
>
>
> On Sat, Aug 2, 2008 at 5:23 PM, Kingsley Idehen <kidehen@openlinksw.com> wrote:
>>
>> Oktie Hassanzadeh wrote:
>>>
>>> Yves Raimond wrote:
>>>>
>>>> Hello!
>>>>
>>>>
>>>>>
>>>>> I would like to suggest that publishers of new linked data spaces that
>>>>> plug
>>>>> into the growing LOD include the following:
>>>>>
>>>>> 1. cross-link information
>>>>>
>>>>
>>>> I would also suggest we find a better measure for interlinkage than a
>>>> raw number of triples linking one dataset to another.
>>>> For example, http://dbtune.org/musicbrainz/ creates its own identifier
>>>> for languages (http://dbtune.org/musicbrainz/directory/language),
>>>> which are owl:sameAs'ed to the corresponding languages in Lingvoj when
>>>> applicable, whereas linkedmdb directly links to the Lingvoj
>>>> identifiers. In the latter case, the raw number of interlinks will be
>>>> higher, but could be reduced a lot by creating identifiers for
>>>> language and use sameAs.
>>>>
>>>> The same applies for geographic locations, for example. Some datasets
>>>> use foaf:based_near to link to Geonames, some others create their own
>>>> identifiers, and then link to the corresponding Geonames locations
>>>> through owl:sameAs. For the same dataset, this two methodologies will
>>>> lead to completely different numbers.
>>>>
>>>> To boost the statistics of a dataset, we could simply link each person
>>>> or group in them to http://dbpedia.org/class/yago/Entity100001740
>>>> through rdf:type :-D
>>>>
>>>> So I think we should agree on what we count as "interlinks" before
>>>> publishing such statistics, so that we can actually use these values?
>>>>
>>>> My recommendation would be to always go for the lowest value - the one
>>>> you'd obtain by creating your own identifiers and using owl:sameAs
>>>> (which would be equivalent to the number of distinct external URIs
>>>> mentioned in your dataset).
>>>>
>>>> What do you think?
>>>>
>>>> Cheers!
>>>> y
>>>>
>>>>
>>>
>>> I totally agree! Some interlinks are not as valuable as others. That's why
>>> we report the number of links based on their type and target and also we
>>> store and publish data about the linkage methodology. I also believe we
>>> should be honest about the value of the interlinks.
>>>
>>> Apart from the links to languages and geographic locations, another
>>> example of such "easy" links is the links we have in LinkedMDB to the
>>> Authors of books in RDF Book Mashup which is done only based on the name of
>>> the authors, comparing with the links to the books related to the movies for
>>> which we have to match the titles and find the ISBN of the books. I just
>>> changed LinkedMDB's statistics [1] to show two different numbers for these
>>> links.
>>>
>>> Regarding languages, I was not sure which is the right way, to link
>>> directly yo lingvoj or to have our own entities for languages, but after
>>> reading some discussions like [1], we decided to link directly to lingvoj.
>>>
>>>
>>> Regards,
>>> Oktie
>>>
>>> [1] http://www.linkedmdb.org:8080/Main/Statistics
>>> [2] http://esw.w3.org/topic/Languages_as_RDF_Resources
>>>
>> Oktie,
>>
>> Re. sample entities, could you sprinkle out a few sample entity URIs from
>> your data space?  For instance, a third column with a drop down should do
>> the trick.
>>
>> --
>>
>>
>> Regards,
>>
>> Kingsley Idehen       Weblog: http://www.openlinksw.com/blog/~kidehen
>> President & CEO OpenLink Software     Web: http://www.openlinksw.com
>>
>>
>>
>>
>
>



-- 
:::: Aldo Bucchi ::::
+56 9 7623 8653
skype:aldo.bucchi
http://aldobucchi.com/
Received on Saturday, 2 August 2008 21:35:59 UTC