- From: Chris Bizer <chris@bizer.de>
- Date: Thu, 30 Jun 2011 22:34:13 +0200
- To: "'Ruben Verborgh'" <ruben.verborgh@ugent.be>
- Cc: "'public-lod'" <public-lod@w3.org>, "'Semantic Web'" <semantic-web@w3.org>, <semanticweb@yahoogroups.com>
Hi Ruben, > Thanks for the fast and detailed reply, it's a very interesting discussion. > > Indeed, there are several ways for mapping and identity resolution. > But what strikes me is that people in the community seem to be insufficiently aware > of the possibilities and performance of current reasoners. Possibly. But luckily we are today in the position to just give it a try. So an idea with my Semantic Web Challenge hat on: Why not take the Billion Triples 2011 data set (http://challenge.semanticweb.org/) which consists of 2 billion triples that have been recently crawled from the Web and try to find all data in the dataset about authors and their publications, map this data to a single target schema and merge all duplicates. Our current LDIF in-memory implementation is not capable of doing this as 2 billion triples are too much data for it. But with the planned Hadoop-based implementation we are hoping to get into this range. It would be very interesting if somebody else would try to solve the task above using a reasoned-based approach and we could then compare the number of authors and publications identified as well as the duration of the data integration process. Anybody interested? Cheers, Chris > As you can see the data translation requires lots of structural > transformations as well as complex property value transformations using > various functions. All things where current logical formalisms are not very > good at. Oh yes, they are. All needed transformations in your paper can be performed by at least two reasoners: cwm [1] and EYE [2] by using built-ins [3]. Include are regular expressions, datatype transforms. Frankly, every transform in the R2R example can be expressed as an N3 rule. > If I as a application developer > want to get a job done, what does it help me if I can exchange mappings > between different tools that all don't get the job done? Because different tools can contribute different results, and if you use a common language and idiom, they all can work with the same data and metadata. > more and more developers know SPARQL which makes it easier for them to learn R2R. The developers that know SPARQL is a proper subset of those that know plain RDF, which is what I suggest using. And even if rules are necessary, N3 is only a small extension of RDF. > Benchmark we have the feeling that SPARQL engines are more suitable for > this task then current reasoning engines due to their performance problems > as well as problems to deal with inconsistent data. The extremely solid performance [4] of EYE is too little known. It can achieve things in linear time that other reasoners can never solve. But my main point is semantics. Why make a new system with its own meanings and interpretations, when there is so much to do with plain RDF and its widely known vocabularies (RDFS, OWL)? Ironically, a tool which contributes to the reconciliation of different RDF sources, does not use common vocabularies to express well-known relationships. Cheers, Ruben [1] http://www.w3.org/2000/10/swap/doc/cwm.html [2] http://eulersharp.sourceforge.net/ [3] http://www.w3.org/2000/10/swap/doc/CwmBuiltins [4] http://eulersharp.sourceforge.net/2003/03swap/dtb-2010.txt On 30 Jun 2011, at 10:51, Chris Bizer wrote: > Hi Ruben, > > thank you for your detailed feedback. > > Of course it is always a question of taste how you prefer to express data > translation rules and I agree that simple mappings can also be expressed > using standard OWL constructs. > > When designing the R2R mapping language, we first analyzed the real-world > requirements that arise if you try to properly integrate data from existing > Linked Data on the Web. We summarize our findings in Section 5 of the > following paper > http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/research/publications/ > BizerSchultz-COLD-R2R-Paper.pdf > As you can see the data translation requires lots of structural > transformations as well as complex property value transformations using > various functions. All things where current logical formalisms are not very > good at. > > Others reasons why we choose to base the mapping language on SPARQL where > that: > > 1. more and more developers know SPARQL which makes it easier for them to > learn R2R. > 2. we to be able to translate large amounts (billions of triples in the > mid-term) of messy inconsistent Web data and from our experience with the > BSBM Benchmark we have the feeling that SPARQL engines are more suitable for > this task then current reasoning engines due to their performance problems > as well as problems to deal with inconsistent data. > > I disagree with you that R2R mappings are not suitable for being exchanged > on the Web. In contrast they were especially designed for being published > and discovered on the Web and allow partial mappings from different sources > to be easily combined (see paper above for details about this). > > I think your argument about the portability of mappings between different > tools currently is only partially valid. If I as a application developer > want to get a job done, what does it help me if I can exchange mappings > between different tools that all don't get the job done? > > Also note, that we aim with LDIF to provide for identity resolution in > addition to schema mapping. It is well known that identity resolution in > practical setting requires rather complex matching heuristics (see Silk > papers for details about different matchers that are usually employed) and > identity resolution is again a topic where reasoning engines don't have too > much to offer. > > But again, there are different ways and tastes about how to express mapping > rules and identity resolution heuristics. R2R and Silk LSL are our > approaches to getting the job done and we are of course happy if other > people provide working solutions for the task of integrating and cleansing > messy data from the Web of Linked Data and are happy to compare our approach > with theirs. > > Cheers, > > Chris
Received on Thursday, 30 June 2011 20:34:55 UTC