Re: ANN: LDIF - Linked Data Integration Framework V0.1 released.

Hi,
   I thought I could share some remarks on the topic. First thing; well 
done for the release of the LDIF, it's an interesting piece of work and 
it's dearly needed. I started to release a bit of my work too, although 
it's in a very early stage (https://github.com/correndo/mediation).

On 6/30/11 10:49 AM, Ruben Verborgh wrote:
> Hi Chris,
>
> Thanks for the fast and detailed reply, it's a very interesting discussion.
>
> Indeed, there are several ways for mapping and identity resolution.
> But what strikes me is that people in the community seem to be insufficiently aware of the possibilities and performance of current reasoners.
About the identity resolution.
Silk it's a nice framework for discovery identities' equivalents 
although I think that for the fruition of such equivalences a more 
distributed approach should be preferred. An approach where the links 
among entities are discovered (no matter with what tool) and *shared* 
could be more organic to an architecture of distributed data publishing.

About the reasoners.
I guess on this issue one could distinguish on where a given reasoner is 
applied. Within Linked Data, where the amount of data is assumed to be 
huge, the application of a reasoner is usually felt as not applicable. 
They just don't scale as well as one would like, although some triple 
stores are having good performances (owlim, 4sr and others).
>
>> As you can see the data translation requires lots of structural
>> transformations as well as complex property value transformations using
>> various functions. All things where current logical formalisms are not very
>> good at.
>
> Oh yes, they are. All needed transformations in your paper can be performed by at least two reasoners: cwm [1] and EYE [2] by using built-ins [3]. Include are regular expressions, datatype transforms…
> Frankly, every transform in the R2R example can be expressed as an N3 rule.
Logic formalisms can be applied to data structural transformation, 
although it sounds a bit of an overkill. I think the real issue here is 
to find the right tool for the right job. If we have heavyweight 
ontologies that differ conceptually one another then a reasoner is the 
right tool. But what if we're dealing more with different data schema 
that don't require a complex reasoning?

There are, I think, two different levels that can be aligned by two 
different formalisms: RDF, and OWL.
Aligning RDF graphs it's something that has little to do with 
description logics, the semantics it's inscribed in the structure and 
structure alignments are therefore called for. A preliminary work I 
published is [1] was based on graph rewriting, but it handles query 
rewriting and it was think to be a lightweight approach (schema alignment).

On the use of patterns literals.. It's a bit of using RDF for describing 
a string whose content's semantics is defined elsewhere. It just doesn't 
sound right, but again, even using RDF and reification for describing at 
least the basic graph patterns [1], doesn't solve the problem of 
semantic elicitation. The interpretation of an alignment is still 
relative to a particular tool.
So, instead of writing literals like this:

mp:Gene
     r2r:sourcePattern "?SUBJ a genes:gene";
     r2r:targetPattern "?SUBJ a smwcat:Gene".

I would have written a chunk of RDF pattern graph like this:

mediation:lhs [ a rdf:Statement ; rdf:subject _:SUBJ ; rdf:predicate rdf:type ; rdf:objectgenes:gene  .]
mediation:rhs [ a rdf:Statement ; rdf:subject _:SUBJ ; rdf:predicate rdf:type ; rdf:objectsmwcat:Gene.]


For aligning OWL ontologies there have been a number of proposals, EDOAL 
[2], C-OWL [3] to name a few not considering the already mentioned 
properties described in OWL (owl:sameAs, owl:equivalentProperty, 
owl:equivalentClass). The question for any formalism for OWL alignment 
formalisms is more to find different profiles of complexity that fit 
different application cases.

[1] http://eprints.ecs.soton.ac.uk/18370/
[2] http://alignapi.gforge.inria.fr/edoal.html
[3] 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.73.9326&rep=rep1&type=pdf


>
>> If I as a application developer
>> want to get a job done, what does it help me if I can exchange mappings
>> between different tools that all don't get the job done?
>
> Because different tools can contribute different results, and if you use a common language and idiom, they all can work with the same data and metadata.
>
>> more and more developers know SPARQL which makes it easier for them to learn R2R.
>
> The developers that know SPARQL is a proper subset of those that know plain RDF, which is what I suggest using. And even if rules are necessary, N3 is only a small extension of RDF.
>
>> Benchmark we have the feeling that SPARQL engines are more suitable for
>> this task then current reasoning engines due to their performance problems
>> as well as problems to deal with inconsistent data.
>
> The extremely solid performance [4] of EYE is too little known. It can achieve things in linear time that other reasoners can never solve.
>
> But my main point is semantics. Why make a new system with its own meanings and interpretations, when there is so much to do with plain RDF and its widely known vocabularies (RDFS, OWL)?
> Ironically, a tool which contributes to the reconciliation of different RDF sources, does not use common vocabularies to express well-known relationships.
>
> Cheers,
>
> Ruben
>
> [1] http://www.w3.org/2000/10/swap/doc/cwm.html
> [2] http://eulersharp.sourceforge.net/
> [3] http://www.w3.org/2000/10/swap/doc/CwmBuiltins
> [4] http://eulersharp.sourceforge.net/2003/03swap/dtb-2010.txt
>
> On 30 Jun 2011, at 10:51, Chris Bizer wrote:
>
>> Hi Ruben,
>>
>> thank you for your detailed feedback.
>>
>> Of course it is always a question of taste how you prefer to express data
>> translation rules and I agree that simple mappings can also be expressed
>> using standard OWL constructs.
>>
>> When designing the R2R mapping language, we first analyzed the real-world
>> requirements that arise if you try to properly integrate data from existing
>> Linked Data on the Web. We summarize our findings in Section 5 of the
>> following paper
>> http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/research/publications/
>> BizerSchultz-COLD-R2R-Paper.pdf
>> As you can see the data translation requires lots of structural
>> transformations as well as complex property value transformations using
>> various functions. All things where current logical formalisms are not very
>> good at.
>>
>> Others reasons why we choose to base the mapping language on SPARQL where
>> that:
>>
>> 1. more and more developers know SPARQL which makes it easier for them to
>> learn R2R.
>> 2. we to be able to translate large amounts (billions of triples in the
>> mid-term) of messy inconsistent Web data and from our experience with the
>> BSBM Benchmark we have the feeling that SPARQL engines are more suitable for
>> this task then current reasoning engines due to their performance problems
>> as well as problems to deal with inconsistent data.
>>
>> I disagree with you that R2R mappings are not suitable for being exchanged
>> on the Web. In contrast they were especially designed for being published
>> and discovered on the Web and allow partial mappings from different sources
>> to be easily combined (see paper above for details about this).
>>
>> I think your argument about the portability of mappings between different
>> tools currently is only partially valid. If I as a application developer
>> want to get a job done, what does it help me if I can exchange mappings
>> between different tools that all don't get the job done?
>>
>> Also note, that we aim with LDIF to provide for identity resolution in
>> addition to schema mapping. It is well known that identity resolution in
>> practical setting requires rather complex matching heuristics (see Silk
>> papers for details about different matchers that are usually employed) and
>> identity resolution is again a topic where reasoning engines don't have too
>> much to offer.
>>
>> But again, there are different ways and tastes about how to express mapping
>> rules and identity resolution heuristics. R2R and Silk LSL are our
>> approaches to getting the job done and we are of course happy if other
>> people provide working solutions for the task of integrating and cleansing
>> messy data from the Web of Linked Data and are happy to compare our approach
>> with theirs.
>>
>> Cheers,
>>
>> Chris
>
>


-- 
******************************************
  Gianluca Correndo
  Research fellow IAM group
  Electronic and Computer Science
  University of Southampton
******************************************

Received on Friday, 1 July 2011 08:33:47 UTC