Re: Auto-generated owl:SameAs links between the RDF Book Mashup and the DBLP database from Chris Bizer on 2006-12-08 (semantic-web@w3.org from December 2006)

From: Chris Bizer <chris@bizer.de>
Date: Fri, 8 Dec 2006 09:38:53 +0100
To: "T.Heath" <T.Heath@open.ac.uk>, <semantic-web@w3.org>
Message-ID: <001e01c71aa4$4b3c9a10$c4e84d57@named4gc1asnuj>
Hi Tom,

> [off-list reply, though happy to take it on list ;)]

very intersting question, which should for sure be taken to the list.

>Hey Chris,
>
>This is cool, and obviously the start of what's going to be a hugely 
>important aspect of the SW.
> However, I have a reservation about the heuristic you're using to generate 
> the owl:sameAs links,
> which primarily comes down to the assumption that Amazon and DBLP cover 
> sufficiently
> similar domains.
>
> Once the WikiSym2006 proceedings [1] get added to DBLP, I'll exist 
> uniquely in that database
> as the only Tom Heath. "Tom Heath" also exists uniquely as an author on 
> Amazon [2], but this
> is not me. According to the current heuristic, the bookmashup would say 
> that
> <http://kmi.open.ac.uk/people/tom/uri> owl:sameAs 
> <http://that-tom-heath-on-amazon/>,
> which really isn't the case.
>
> I have no idea how common this situation would be, but I think a more 
> sophisticated
> approach is needed if we're going to avoid littering the Semantic Web with 
> sameAs links
> that don't stand up.
>
> Interesting stuff. What dya reckon? :)

I think that as the Semantic Web is moving from toy examples to real world 
data sources, auto-generated links will become very important to glue 
instances in separate data sources together and to realize the Semantic Web 
as a single inter-linked information space instead of having separate data 
island. I also think that the ability to have typed links between data 
sources on instance level, is one of the most important factors 
distinguishing the Semantic Web from the current Web 2.0 information 
ecosystem.

But yes, you are right, our heuristic is way too simple (which I think is OK 
for a first prototype).

Your question thus triggers two interesting problems: Better heuristics and 
trust.

There is lots of interesting work in the database community on object 
identification and duplicate detection which could be facilitated to 
implement better heuristics. There was this workshop on ontology matching at 
ISWC http://www.om2006.ontologymatching.org/ (which I didn't visit) and I 
guess these guys should also have some good solutions in their drawers. Yes?

I think for our Book Mashup/DBLP use case, a better heuristic could rely on 
the similarity of book and paper titles or on co-author relationships. 
Anyone further suggestions? Preferably in the form of PHP code ;-)

Radek, a college of mine developed SemMF, semantic matching framework which 
might also be useful in this context 
http://sites.wiwiss.fu-berlin.de/suhl/radek/semmf/, though I think the 
framework assumes to have all data to be matched in a single repository, 
which might be unrealistic when you talk about huge data sources like Google 
Base or Amazon. Thus, new requirement for matching techniques: Identify a 
corresponding instance, when you only have a SPARQL endpoint and have to 
avoid database dumps.

The other aspect triggered by your question is trust. We can not expect all 
information on the Semantic Web to be true and all information providers to 
use sophisticated algorithm to set links. Thus our client tools have to be 
capable to deal with different kinds of junk (including my owl:sameAs 
links).

We did some work on policy-based information filtering 
(http://sites.wiwiss.fu-berlin.de/suhl/bizer/wiqa/browser/index.htm) which 
might be useful in this context. For instance one could image a client to 
use a policy like "Trust the Book Mashup about books and reviews, but forget 
about its owl:sameAs links".

> Tom.
>
> PS. On a related note, I'd be really interested to try hooking up 
> Revyu.com book reviews [3] to
> the BookMashup. It would take some syntactic parsing tricks, but using 
> your RDF/XML would
> enable me to just use RAP to handle the data, rather than rolling my own 
> Amazon parser. Nice.

Sure, do you have a SPARQL endpoint for Revyu.com? Then the Book Mashup 
could query your site when it generates a book description and integrate 
your reviews into the description.

> PPS. As an aside to this issue of disambiguation, even reliable background 
> knowledge sources
> with which to disambiguate names may be hard to find. The books on Amazon 
> by the other Tom
> Heath are about "Crosby, Seaforth and Waterloo", parts of Liverpool. I 
> happened to live in
> Liverpool at the time the books were published, and there is plenty of 
> info on the web linking
> me to Liverpool, so even a human being casually browsing could have a 
> guess that I wrote
> those books.

As I guess "Crosby, Seaforth and Waterloo" doesn't appear in the title of 
your paper, maybe comparing book and paper titles could be an OK heuristic 
for our use case.

That's what I like about publishing lots of real world data on the Semantic 
Web instead of the usual toy examples: The data let's you discover the real 
problems and ask the real questions that we have to solve in order to make 
the Semantic Web work.

Cheers

Chris


> [1] 
> <http://portal.acm.org/toc.cfm?id=1149453&type=proceeding&coll=Portal&dl=ACM&CFID=8309673&CFTOKEN=96281626>
> [2] 
> <http://www.amazon.co.uk/exec/obidos/search-handle-url/203-4998033-8521554?%5Fencoding=UTF8&search-type=ss&index=books-uk&field-author=Tom%20Heath>
> [3] http://revyu.com/tags/book
>
>
> -----Original Message-----
> From: semantic-web-request@w3.org
> [mailto:semantic-web-request@w3.org] On Behalf Of Chris Bizer
> Sent: 05 December 2006 11:09
> To: semantic-web@w3.org
> Subject: Auto-generated owl:SameAs links between the RDF Book
> Mashup and the DBLP database
>
>
>
>
> Hi,
>
> a central strength of the Semantic Web is that it allows you
> to set links between information about the same object within
> multiple data sources.
>
> Our RDF book mashup [1] generates RDF descriptions about
> books and their authors. A second publicly available
> bibliographic data source is the DBLP database containing
> journal articles and conference papers. The DBLP database is
> published as linked data by a D2R Server at
> http://www4.wiwiss.fu-berlin.de/dblp/.
>
> In order to demonstrate links between different data sources,
> we have added another feature to the RDF book mashup: The
> mashup now automatically generates owl:sameAs links between
> book authors and paper authors in the DBLP database. Using
> Tabulator, these links allow you to navigate from the
> description of the author of a book to his papers in the DBLP
> database.
>
> The links are generated by asking the SPARQL-endpoint of the
> DBLP database for URIs identifying book authors. If the query
> for a foaf:person with a specific name returns only one
> result and as both domains are related, we assume that it is
> likely enough that we have hit the right person, to set the
> owl:sameAs link.
>
> An example of such an auto-generated owl:sameAs link is found
> in the data about Tim Berners-Lee:
> http://www4.wiwiss.fu-berlin.de/bookmashup/persons/Tim+Berners-Lee
>
> <foaf:Person
> rdf:about="http://www4.wiwiss.fu-berlin.de/bookmashup/persons/
> Tim+Berners-Le
> e">
>    <owl:sameAs
> rdf:resource="http://www4.wiwiss.fu-berlin.de/dblp/resource/pe
> rson/100007"/>
>    <foaf:name>Tim Berners-Lee</foaf:name>
> </foaf:Person>
>
>
> Cheers,
>
> Chris
>
> [1] http://sites.wiwiss.fu-berlin.de/suhl/bizer/bookmashup/index.html
>
>
> --
> Chris Bizer
> Freie Universität Berlin
> Phone: +49 30 838 54057
> Mail: chris@bizer.de
> Web: www.bizer.de
>
>
>
>
>
Received on Friday, 8 December 2006 08:39:04 UTC