- From: Chris Bizer <chris@bizer.de>
- Date: Fri, 8 Dec 2006 09:38:53 +0100
- To: "T.Heath" <T.Heath@open.ac.uk>, <semantic-web@w3.org>
Hi Tom, > [off-list reply, though happy to take it on list ;)] very intersting question, which should for sure be taken to the list. >Hey Chris, > >This is cool, and obviously the start of what's going to be a hugely >important aspect of the SW. > However, I have a reservation about the heuristic you're using to generate > the owl:sameAs links, > which primarily comes down to the assumption that Amazon and DBLP cover > sufficiently > similar domains. > > Once the WikiSym2006 proceedings [1] get added to DBLP, I'll exist > uniquely in that database > as the only Tom Heath. "Tom Heath" also exists uniquely as an author on > Amazon [2], but this > is not me. According to the current heuristic, the bookmashup would say > that > <http://kmi.open.ac.uk/people/tom/uri> owl:sameAs > <http://that-tom-heath-on-amazon/>, > which really isn't the case. > > I have no idea how common this situation would be, but I think a more > sophisticated > approach is needed if we're going to avoid littering the Semantic Web with > sameAs links > that don't stand up. > > Interesting stuff. What dya reckon? :) I think that as the Semantic Web is moving from toy examples to real world data sources, auto-generated links will become very important to glue instances in separate data sources together and to realize the Semantic Web as a single inter-linked information space instead of having separate data island. I also think that the ability to have typed links between data sources on instance level, is one of the most important factors distinguishing the Semantic Web from the current Web 2.0 information ecosystem. But yes, you are right, our heuristic is way too simple (which I think is OK for a first prototype). Your question thus triggers two interesting problems: Better heuristics and trust. There is lots of interesting work in the database community on object identification and duplicate detection which could be facilitated to implement better heuristics. There was this workshop on ontology matching at ISWC http://www.om2006.ontologymatching.org/ (which I didn't visit) and I guess these guys should also have some good solutions in their drawers. Yes? I think for our Book Mashup/DBLP use case, a better heuristic could rely on the similarity of book and paper titles or on co-author relationships. Anyone further suggestions? Preferably in the form of PHP code ;-) Radek, a college of mine developed SemMF, semantic matching framework which might also be useful in this context http://sites.wiwiss.fu-berlin.de/suhl/radek/semmf/, though I think the framework assumes to have all data to be matched in a single repository, which might be unrealistic when you talk about huge data sources like Google Base or Amazon. Thus, new requirement for matching techniques: Identify a corresponding instance, when you only have a SPARQL endpoint and have to avoid database dumps. The other aspect triggered by your question is trust. We can not expect all information on the Semantic Web to be true and all information providers to use sophisticated algorithm to set links. Thus our client tools have to be capable to deal with different kinds of junk (including my owl:sameAs links). We did some work on policy-based information filtering (http://sites.wiwiss.fu-berlin.de/suhl/bizer/wiqa/browser/index.htm) which might be useful in this context. For instance one could image a client to use a policy like "Trust the Book Mashup about books and reviews, but forget about its owl:sameAs links". > Tom. > > PS. On a related note, I'd be really interested to try hooking up > Revyu.com book reviews [3] to > the BookMashup. It would take some syntactic parsing tricks, but using > your RDF/XML would > enable me to just use RAP to handle the data, rather than rolling my own > Amazon parser. Nice. Sure, do you have a SPARQL endpoint for Revyu.com? Then the Book Mashup could query your site when it generates a book description and integrate your reviews into the description. > PPS. As an aside to this issue of disambiguation, even reliable background > knowledge sources > with which to disambiguate names may be hard to find. The books on Amazon > by the other Tom > Heath are about "Crosby, Seaforth and Waterloo", parts of Liverpool. I > happened to live in > Liverpool at the time the books were published, and there is plenty of > info on the web linking > me to Liverpool, so even a human being casually browsing could have a > guess that I wrote > those books. As I guess "Crosby, Seaforth and Waterloo" doesn't appear in the title of your paper, maybe comparing book and paper titles could be an OK heuristic for our use case. That's what I like about publishing lots of real world data on the Semantic Web instead of the usual toy examples: The data let's you discover the real problems and ask the real questions that we have to solve in order to make the Semantic Web work. Cheers Chris > [1] > <http://portal.acm.org/toc.cfm?id=1149453&type=proceeding&coll=Portal&dl=ACM&CFID=8309673&CFTOKEN=96281626> > [2] > <http://www.amazon.co.uk/exec/obidos/search-handle-url/203-4998033-8521554?%5Fencoding=UTF8&search-type=ss&index=books-uk&field-author=Tom%20Heath> > [3] http://revyu.com/tags/book > > > -----Original Message----- > From: semantic-web-request@w3.org > [mailto:semantic-web-request@w3.org] On Behalf Of Chris Bizer > Sent: 05 December 2006 11:09 > To: semantic-web@w3.org > Subject: Auto-generated owl:SameAs links between the RDF Book > Mashup and the DBLP database > > > > > Hi, > > a central strength of the Semantic Web is that it allows you > to set links between information about the same object within > multiple data sources. > > Our RDF book mashup [1] generates RDF descriptions about > books and their authors. A second publicly available > bibliographic data source is the DBLP database containing > journal articles and conference papers. The DBLP database is > published as linked data by a D2R Server at > http://www4.wiwiss.fu-berlin.de/dblp/. > > In order to demonstrate links between different data sources, > we have added another feature to the RDF book mashup: The > mashup now automatically generates owl:sameAs links between > book authors and paper authors in the DBLP database. Using > Tabulator, these links allow you to navigate from the > description of the author of a book to his papers in the DBLP > database. > > The links are generated by asking the SPARQL-endpoint of the > DBLP database for URIs identifying book authors. If the query > for a foaf:person with a specific name returns only one > result and as both domains are related, we assume that it is > likely enough that we have hit the right person, to set the > owl:sameAs link. > > An example of such an auto-generated owl:sameAs link is found > in the data about Tim Berners-Lee: > http://www4.wiwiss.fu-berlin.de/bookmashup/persons/Tim+Berners-Lee > > <foaf:Person > rdf:about="http://www4.wiwiss.fu-berlin.de/bookmashup/persons/ > Tim+Berners-Le > e"> > <owl:sameAs > rdf:resource="http://www4.wiwiss.fu-berlin.de/dblp/resource/pe > rson/100007"/> > <foaf:name>Tim Berners-Lee</foaf:name> > </foaf:Person> > > > Cheers, > > Chris > > [1] http://sites.wiwiss.fu-berlin.de/suhl/bizer/bookmashup/index.html > > > -- > Chris Bizer > Freie Universität Berlin > Phone: +49 30 838 54057 > Mail: chris@bizer.de > Web: www.bizer.de > > > > >
Received on Friday, 8 December 2006 08:39:04 UTC