Re: ANN: LDIF - Linked Data Integration Framework V0.1 released. from Chris Bizer on 2011-06-30 (semantic-web@w3.org from June 2011)

From: Chris Bizer <chris@bizer.de>
Date: Thu, 30 Jun 2011 10:51:54 +0200
To: "'Ruben Verborgh'" <ruben.verborgh@ugent.be>
Cc: "'public-lod'" <public-lod@w3.org>, "'Semantic Web'" <semantic-web@w3.org>, <semanticweb@yahoogroups.com>
Message-ID: <005b01cc3702$f6993600$e3cba200$@de>
Hi Ruben,

thank you for your detailed feedback.

Of course it is always a question of taste how you prefer to express data
translation rules and I agree that simple mappings can also be expressed
using standard OWL constructs.

When designing the R2R mapping language, we first analyzed the real-world
requirements that arise if you try to properly integrate data from existing
Linked Data on the Web. We summarize our findings in Section 5 of the
following paper
http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/research/publications/
BizerSchultz-COLD-R2R-Paper.pdf
As you can see the data translation requires lots of structural
transformations as well as complex property value transformations using
various functions. All things where current logical formalisms are not very
good at. 

Others reasons why we choose to base the mapping language on SPARQL where
that:

1. more and more developers know SPARQL which makes it easier for them to
learn R2R.
2. we to be able to translate large amounts (billions of triples in the
mid-term) of messy inconsistent Web data and from our experience with the
BSBM Benchmark we have the feeling that SPARQL engines are more suitable for
this task then current reasoning engines due to their performance problems
as well as problems to deal with inconsistent data. 

I disagree with you that R2R mappings are not suitable for being exchanged
on the Web. In contrast they were especially designed for being published
and discovered on the Web and allow partial mappings from different sources
to be easily combined (see paper above for details about this).

I think your argument about the portability of mappings between different
tools currently is only partially valid. If I as a application developer
want to get a job done, what does it help me if I can exchange mappings
between different tools that all don't get the job done?

Also note, that we aim with LDIF to provide for identity resolution in
addition to schema mapping. It is well known that identity resolution in
practical setting requires rather complex matching heuristics (see Silk
papers for details about different matchers that are usually employed) and
identity resolution is again a topic where reasoning engines don't have too
much to offer.

But again, there are different ways and tastes about how to express mapping
rules and identity resolution heuristics. R2R and Silk LSL are our
approaches to getting the job done and we are of course happy if other
people provide working solutions for the task of integrating and cleansing
messy data from the Web of Linked Data and are happy to compare our approach
with theirs.

Cheers,

Chris


-----Ursprüngliche Nachricht-----
Von: Ruben Verborgh [mailto:ruben.verborgh@ugent.be] 
Gesendet: Donnerstag, 30. Juni 2011 10:04
An: Chris Bizer
Cc: 'public-lod'; 'Semantic Web'; semanticweb@yahoogroups.com
Betreff: Re: ANN: LDIF - Linked Data Integration Framework V0.1 released.

Hi Chris,

I've taken a look at your work and it is certainly interesting.

However, I have a couple questions with regarding the approach you have
taken.
The example [1] shows that we need to create a specific mapping. But can we
call this "semantic"?
It is a configuration file which can only be understood by a specific tool.
It could as well have been XML or another format.
Why not choose to express the same things using existing, semantic
predicates, which can be understood by different tools and express actual
knowledge?
And why not rely on existing ontologies that express relations semantically,
and reuse portable knowledge?
Example:

mp:Gene
    r2r:sourcePattern "?SUBJ a genes:gene";
    r2r:targetPattern "?SUBJ a smwcat:Gene".

could be

genes:gene owl:sameAs smwcat:Gene.

Not only does this have universally accepted semantics, it is also portable
to different situations. For example:
_:specializedGene rdfs:subClassOf genes:gene.


Another thing is that I do not agree with the pattern literals.
If we take a look at such a pattern:

"?SUBJ a genes:gene",

we see there are a lot of implicit things here.
First, the prefix needs to be looked up using the r2r:prefixDefinitions
predicate. So a specific syntax (Turtle prefixes) is tied to a conceptual
model. I can imagine a lot of problems here. Clearly, r2r:prefixDefinitions
is some kind of functional property. But when are two prefixDefinitions the
same? Exact string comparison is not the answer.
But the bigger problem I'm having is with the variables. With the ?SUBJ
notation, you seem to add implicit support for universal quantification.
This last sentence clarifies the big issue: "implicit". Variables are
placeholders identified by a certain name in a certain scope, but the name
itself is unimportant.

Concretely, "?SUBJ a genes:gene" should mean the same as "?s a genes:gene".
Except that it doesn't.
Because now, "?SUBJ a smwcat:Gene" is no longer meaningful. (Similar to the
above, how to define equality?)
And okay, you can argue that the scope is not the string, but the RDF
document.
But what if I put the second statement in a different document? It's RDF,
right, or is this an application-specific configuration file?
And okay, we can say that the scope is the entity it belongs to. But then we
have a major problem:

mp:GeneID
  r2r:mappingRef mp:Gene;
  r2r:sourcePattern "?SUBJ genes:GeneId ?x";
  r2r:targetPattern "?SUBJ smwprop:KeggGeneId ?'x'^^xsd:string";

GeneID also uses the ?SUBJ variable, but also has a relationship with Gene.
This puts them in the same scope. But clearly, the ?SUBJ from Gene and the
?SUBJ from GeneID should be different. This is a serious problem, which
cannot be solved rigorously, so the semantics will remain messy, since
variables and scope are not formally defined.

You can invalidate my arguments by saying that this RDF document is only
meant for a specific purpose etc. But why use RDF then, which is all about
portable semantics? See my question at the top of this e-mail.


As a solution, I would propose a W3C team submission which deals with
quantification properly: Notation3 [2]. They really got quantification
right. Look how much more semantic (and thus portable!) things become:

mp:hasPathway
	a r2r:PropertyMapping;
	r2r:mappingRef    	mp:Gene;
	r2r:sourcePattern 	"?SUBJ genes:hasPathway ?x";
	r2r:targetPattern	"?SUBJ smwprop:IsInvolvedIn ?x . ?x
smwprop:Involves ?SUBJ";

becomes

{
  ?s genes:hasPathway ?x.
}
=>
{
  ?s smwprop:IsInvolvedIn ?x
  ?x smwprop:Involves ?s.
}.

Note how the variables now have proper scoped and meaning. But even
quantification isn't necessary here:

genes:hasPathway rdfs:subPropertyOf smwprop:IsInvolvedIn.
genes:hasPathway rdfs:subPropertyOf smwprop:Involves.

This exactly matches the definition of a subproperty [3]: "If a property P'
is a super-property of a property P, then all pairs of resources which are
related by P are also related by P'."

The major benefit of this is that everything can happen by general-purpose
Semantic Web reasoners, which rely on the *explicit* semantics present in
the document. The semantics are portable to different situations and
contexts.


I'm eager to learn about the reasons of adaption of this custom vocabulary
and methodology, and the added value of this approach, instead of relying on
standards and widely accepted practices, and how your approach is portable
to other contexts.

[1]
http://www.assembla.com/code/ldif/git/nodes/ldif/ldif-singlemachine/src/main
/resources/ldif/local/example/test2/mappings/ALL-to-Wiki.r2r.ttl?rev=1764288
45b9594e28a2f0362916de23cc821502c
[2] http://www.w3.org/TeamSubmission/n3/
[3] http://www.w3.org/TR/rdf-schema/#def-subproperty

Sincerely,
-- 
Ruben Verborgh

Ghent University - IBBT
Faculty of Engineering and Architecture
Department of Electronics and Information Systems (ELIS)
Multimedia Lab
Gaston Crommenlaan 8 bus 201
B-9050 Ledeberg-Ghent
Belgium

t: +32 9 33 14959
f: +32 9 33 14896
t secr: +32 9 33 14911
e: ruben.verborgh@ugent.be

URL: http://multimedialab.elis.ugent.be

On 29 Jun 2011, at 15:23, Chris Bizer wrote:

> Hi all,
>  
> we are happy to announce the initial release of the LDIF – Linked Data
Integration Framework today.
>  
> LDIF is a software component for building Linked Data applications which
translates heterogeneous Linked Data from the Web into
> a clean, local target representation while keeping track of data
provenance.
>  
> Applications that consume Linked Data from the Web are confronted with the
following two challenges:
>  
> 1. data sources use a wide range of different RDF vocabularies to
represent data about the same type of entity.
> 2. the same real-world entity, for instance a person or a place, is
identified with different URIs within different data sources.
>  
> The usage of various vocabularies as well as the usage of URI aliases
makes it very cumbersome for an application developer to write for instance
SPARQL queries against Web data that originates from multiple sources.
>  
> A successful approach to ease using Web data in the application context is
to translate heterogeneous data into a single local target vocabulary and to
replace URI aliases with a single target URI on the client side before
starting to ask SPARQL queries against the data.
>  
> Up-till-now, there have not been any integrated tools available that help
application developers with these tasks.
>  
> With LDIF, we try to fill this gap and provide an initial alpha version of
an open-source Linked Data Integration Framework that can be used by Linked
Data applications to translate Web data and normalize URI aliases.
>  
> For Identity resolution, LDIF builds on the Silk Link Discovery Framework.
> For data translation, LDIF employs the R2R Mapping Framework. 
> 
> More information about LDIF and a concrete usage example is provided on
the LDIF website at
>  
> http://www4.wiwiss.fu-berlin.de/bizer/ldif/
>  
> Lots of thanks to
>  
> Andreas Schultz (FUB)
> Andrea Matteini (MES)
> Robert Isele (FUB)
> Christian Becker (MES)
>  
> for their great work on the framework.
>  
> Best,
>  
> Chris
>  
>  
> Acknowledgments
>  
> The development of LIDF is supported in part by Vulcan Inc. as part of its
Project Halo and by the EU FP7 project LOD2 - Creating Knowledge out of
Interlinked Data (Grant No. 257943).
>  
> --
> Prof. Dr. Christian Bizer
> Web-based Systems Group
> Freie Universität Berlin
> +49 30 838 55509
> http://www.bizer.de
> chris@bizer.de
>
Received on Thursday, 30 June 2011 08:52:24 UTC