RE: Auto-generated owl:SameAs links between the RDF Book Mashup and the DBLP database

Hi Chris,

I think you're right on all counts: we need these links, we need (at least most of) them to be autogenerated, we won't be able to trust a lot of them, but we won't always know which ones. The trust issue needs ongoing work for sure, but I reckon we can start to have a go at the heuristics.

> I think for our Book Mashup/DBLP use case, a better heuristic could rely on 
> the similarity of book and paper titles or on co-author relationships. 
> Anyone further suggestions? Preferably in the form of PHP code ;-)

Agreed. Duncan McRae-Spencee from Southampton did some work on this <http://eprints.ecs.soton.ac.uk/12704/>. Sure there is plenty more out there.

> Sure, do you have a SPARQL endpoint for Revyu.com? Then the Book Mashup could query your site when it 
> generates a book description and integrate your reviews into the description.

Absolutely: <http://revyu.com/sparql/welcome>. This would be very cool. The challenge is to match books in the book mashup with things in Revyu.com, as the data in Revyu is deliberately very light touch. Perhaps a good place to start would be looking for occurences of the title and the author strings in the rdfs:label of things that have been reviewed. Have a look at <http://revyu.com/things/running-with-scissors-by-augusten-burroughs> for an example. Things having the tag <http://revyu.com/tags/book> may also help narrow the field, as would the rdfs:seeAlso links to Amazon domains which are associated with some things (in fact in these cases we can parse out the ISBN from the URL). Then we just(!) have to decide on the matching threshold beyond which we assert owl:sameAs. (BTW, the underlying structure of the data isn't fully documented in the pages accompanying the SPARQL endpoint yet, so I'll get that updated on Monday).

On a final note, do you have any plans to do the same with non-Book Amazon items? Things such as DVDs and CDs should have an EAN13 identifier, and I'd certainly be interested in trying to mashup information on those with Revyu reviews.

Cheers,

Tom.

> -----Original Message-----
> From: Chris Bizer [mailto:chris@bizer.de] 
> Sent: 08 December 2006 08:39
> To: T.Heath; semantic-web@w3.org
> Subject: Re: Auto-generated owl:SameAs links between the RDF 
> Book Mashup and the DBLP database
> 
> 
> Hi Tom,
> 
> > [off-list reply, though happy to take it on list ;)]
> 
> very intersting question, which should for sure be taken to the list.
> 
> >Hey Chris,
> >
> >This is cool, and obviously the start of what's going to be a hugely
> >important aspect of the SW.
> > However, I have a reservation about the heuristic you're 
> using to generate 
> > the owl:sameAs links,
> > which primarily comes down to the assumption that Amazon 
> and DBLP cover 
> > sufficiently
> > similar domains.
> >
> > Once the WikiSym2006 proceedings [1] get added to DBLP, I'll exist
> > uniquely in that database
> > as the only Tom Heath. "Tom Heath" also exists uniquely as 
> an author on 
> > Amazon [2], but this
> > is not me. According to the current heuristic, the 
> bookmashup would say 
> > that
> > <http://kmi.open.ac.uk/people/tom/uri> owl:sameAs 
> > <http://that-tom-heath-on-amazon/>,
> > which really isn't the case.
> >
> > I have no idea how common this situation would be, but I 
> think a more
> > sophisticated
> > approach is needed if we're going to avoid littering the 
> Semantic Web with 
> > sameAs links
> > that don't stand up.
> >
> > Interesting stuff. What dya reckon? :)
> 
> I think that as the Semantic Web is moving from toy examples 
> to real world 
> data sources, auto-generated links will become very important to glue 
> instances in separate data sources together and to realize 
> the Semantic Web 
> as a single inter-linked information space instead of having 
> separate data 
> island. I also think that the ability to have typed links 
> between data 
> sources on instance level, is one of the most important factors 
> distinguishing the Semantic Web from the current Web 2.0 information 
> ecosystem.
> 
> But yes, you are right, our heuristic is way too simple 
> (which I think is OK 
> for a first prototype).
> 
> Your question thus triggers two interesting problems: Better 
> heuristics and 
> trust.
> 
> There is lots of interesting work in the database community on object 
> identification and duplicate detection which could be facilitated to 
> implement better heuristics. There was this workshop on 
> ontology matching at 
> ISWC http://www.om2006.ontologymatching.org/ (which I didn't 
> visit) and I 
> guess these guys should also have some good solutions in 
> their drawers. Yes?
> 
> I think for our Book Mashup/DBLP use case, a better heuristic 
> could rely on 
> the similarity of book and paper titles or on co-author 
> relationships. 
> Anyone further suggestions? Preferably in the form of PHP code ;-)
> 
> Radek, a college of mine developed SemMF, semantic matching 
> framework which 
> might also be useful in this context 
> http://sites.wiwiss.fu-berlin.de/suhl/radek/semmf/, though I 
> think the 
> framework assumes to have all data to be matched in a single 
> repository, 
> which might be unrealistic when you talk about huge data 
> sources like Google 
> Base or Amazon. Thus, new requirement for matching 
> techniques: Identify a 
> corresponding instance, when you only have a SPARQL endpoint 
> and have to 
> avoid database dumps.
> 
> The other aspect triggered by your question is trust. We can 
> not expect all 
> information on the Semantic Web to be true and all 
> information providers to 
> use sophisticated algorithm to set links. Thus our client 
> tools have to be 
> capable to deal with different kinds of junk (including my owl:sameAs 
> links).
> 
> We did some work on policy-based information filtering 
> (http://sites.wiwiss.fu-berlin.de/suhl/bizer/wiqa/browser/inde
> x.htm) which 
> might be useful in this context. For instance one could image 
> a client to 
> use a policy like "Trust the Book Mashup about books and 
> reviews, but forget 
> about its owl:sameAs links".
> 
> > Tom.
> >
> > PS. On a related note, I'd be really interested to try hooking up
> > Revyu.com book reviews [3] to
> > the BookMashup. It would take some syntactic parsing 
> tricks, but using 
> > your RDF/XML would
> > enable me to just use RAP to handle the data, rather than 
> rolling my own 
> > Amazon parser. Nice.
> 
> Sure, do you have a SPARQL endpoint for Revyu.com? Then the 
> Book Mashup 
> could query your site when it generates a book description 
> and integrate 
> your reviews into the description.
> 
> > PPS. As an aside to this issue of disambiguation, even reliable 
> > background
> > knowledge sources
> > with which to disambiguate names may be hard to find. The 
> books on Amazon 
> > by the other Tom
> > Heath are about "Crosby, Seaforth and Waterloo", parts of 
> Liverpool. I 
> > happened to live in
> > Liverpool at the time the books were published, and there 
> is plenty of 
> > info on the web linking
> > me to Liverpool, so even a human being casually browsing 
> could have a 
> > guess that I wrote
> > those books.
> 
> As I guess "Crosby, Seaforth and Waterloo" doesn't appear in 
> the title of 
> your paper, maybe comparing book and paper titles could be an 
> OK heuristic 
> for our use case.
> 
> That's what I like about publishing lots of real world data 
> on the Semantic 
> Web instead of the usual toy examples: The data let's you 
> discover the real 
> problems and ask the real questions that we have to solve in 
> order to make 
> the Semantic Web work.
> 
> Cheers
> 
> Chris
> 
> 
> > [1]
> > 
> <http://portal.acm.org/toc.cfm?id=1149453&type=proceeding&coll
> =Portal&dl=ACM&CFID=8309673&CFTOKEN=96281626>
> > [2] 
> > 
> <http://www.amazon.co.uk/exec/obidos/search-handle-url/203-499
> 8033-8521554?%5Fencoding=UTF8&search-type=ss&index=books-uk&fi
> eld-author=Tom%20Heath>
> > [3] http://revyu.com/tags/book
> >
> >
> > -----Original Message-----
> > From: semantic-web-request@w3.org 
> [mailto:semantic-web-request@w3.org] 
> > On Behalf Of Chris Bizer
> > Sent: 05 December 2006 11:09
> > To: semantic-web@w3.org
> > Subject: Auto-generated owl:SameAs links between the RDF 
> Book Mashup 
> > and the DBLP database
> >
> >
> >
> >
> > Hi,
> >
> > a central strength of the Semantic Web is that it allows you to set 
> > links between information about the same object within 
> multiple data 
> > sources.
> >
> > Our RDF book mashup [1] generates RDF descriptions about books and 
> > their authors. A second publicly available bibliographic 
> data source 
> > is the DBLP database containing journal articles and conference 
> > papers. The DBLP database is published as linked data by a 
> D2R Server 
> > at http://www4.wiwiss.fu-berlin.de/dblp/.
> >
> > In order to demonstrate links between different data 
> sources, we have 
> > added another feature to the RDF book mashup: The mashup now 
> > automatically generates owl:sameAs links between book authors and 
> > paper authors in the DBLP database. Using Tabulator, these 
> links allow 
> > you to navigate from the description of the author of a book to his 
> > papers in the DBLP database.
> >
> > The links are generated by asking the SPARQL-endpoint of the DBLP 
> > database for URIs identifying book authors. If the query for a 
> > foaf:person with a specific name returns only one result 
> and as both 
> > domains are related, we assume that it is likely enough 
> that we have 
> > hit the right person, to set the owl:sameAs link.
> >
> > An example of such an auto-generated owl:sameAs link is 
> found in the 
> > data about Tim Berners-Lee: 
> > http://www4.wiwiss.fu-berlin.de/bookmashup/persons/Tim+Berners-Lee
> >
> > <foaf:Person 
> > rdf:about="http://www4.wiwiss.fu-berlin.de/bookmashup/persons/
> > Tim+Berners-Le
> > e">
> >    <owl:sameAs 
> > rdf:resource="http://www4.wiwiss.fu-berlin.de/dblp/resource/pe
> > rson/100007"/>
> >    <foaf:name>Tim Berners-Lee</foaf:name>
> > </foaf:Person>
> >
> >
> > Cheers,
> >
> > Chris
> >
> > [1] 
> http://sites.wiwiss.fu-berlin.de/suhl/bizer/bookmashup/index.html
> >
> >
> > --
> > Chris Bizer
> > Freie Universität Berlin
> > Phone: +49 30 838 54057
> > Mail: chris@bizer.de
> > Web: www.bizer.de
> >
> >
> >
> >
> > 
> 
> 

Received on Friday, 8 December 2006 16:21:23 UTC