- From: Ruben Verborgh <ruben.verborgh@ugent.be>
- Date: Fri, 1 Jul 2011 07:49:15 +0200
- To: Chris Bizer <chris@bizer.de>
- Cc: "'public-lod'" <public-lod@w3.org>, "'Semantic Web'" <semantic-web@w3.org>, <semanticweb@yahoogroups.com>
Hi Chris, Sounds like a challenge indeed :) Thanks for bringing this to my attention. While we have a lot of experience with reasoning, we never tried to go to the billions. I contacted Jos De Roo, the author of the EYE reasoner, to see what would be possible. I think we might at least be able to perform some interesting stuff. Note however that performance is a separate issue from what I was saying before. No matter how good the LDIF Hadoop implementation will perform (and I am curious to find out!), for me, it doesn't justify creating a whole new semantics. The important thing here is that the R2R patterns can be generated from regular RDFS and OWL constructs (because these have a well-defined meaning!), while the other way round is difficult and impossible in general. If your (or anyone else's) software needs a different representation, why not create it from RDF documents that use those Semantic Web foundations instead of forcing people to write those instructions? Reuse is so important in our community, and while software will someday be able to bring a lot of data together, humans will always be responsible for getting things right at the very base. Cheers, Ruben On 30 Jun 2011, at 22:34, Chris Bizer wrote: > Hi Ruben, > >> Thanks for the fast and detailed reply, it's a very interesting > discussion. >> >> Indeed, there are several ways for mapping and identity resolution. >> But what strikes me is that people in the community seem to be > insufficiently aware >> of the possibilities and performance of current reasoners. > > Possibly. But luckily we are today in the position to just give it a try. > > So an idea with my Semantic Web Challenge hat on: > > Why not take the Billion Triples 2011 data set > (http://challenge.semanticweb.org/) which consists of 2 billion triples that > have been recently crawled from the Web and try to find all data in the > dataset about authors and their publications, map this data to a single > target schema and merge all duplicates. > > Our current LDIF in-memory implementation is not capable of doing this as 2 > billion triples are too much data for it. But with the planned Hadoop-based > implementation we are hoping to get into this range. > > It would be very interesting if somebody else would try to solve the task > above using a reasoned-based approach and we could then compare the number > of authors and publications identified as well as the duration of the data > integration process. > > Anybody interested? > > Cheers, > > Chris
Received on Friday, 1 July 2011 05:49:45 UTC