Re: ANN: LDIF - Linked Data Integration Framework V0.1 released. from Ruben Verborgh on 2011-06-30 (semantic-web@w3.org from June 2011)

From: Ruben Verborgh <ruben.verborgh@ugent.be>
Date: Thu, 30 Jun 2011 10:04:01 +0200
To: Chris Bizer <chris@bizer.de>
Cc: "'public-lod'" <public-lod@w3.org>, "'Semantic Web'" <semantic-web@w3.org>, <semanticweb@yahoogroups.com>
Message-Id: <3DC7B9DF-973C-4604-91AD-B354E48C1126@ugent.be>
Hi Chris,

I've taken a look at your work and it is certainly interesting.

However, I have a couple questions with regarding the approach you have taken.
The example [1] shows that we need to create a specific mapping. But can we call this "semantic"?
It is a configuration file which can only be understood by a specific tool. It could as well have been XML or another format.
Why not choose to express the same things using existing, semantic predicates, which can be understood by different tools and express actual knowledge?
And why not rely on existing ontologies that express relations semantically, and reuse portable knowledge?
Example:

mp:Gene
    r2r:sourcePattern "?SUBJ a genes:gene";
    r2r:targetPattern "?SUBJ a smwcat:Gene".

could be

genes:gene owl:sameAs smwcat:Gene.

Not only does this have universally accepted semantics, it is also portable to different situations. For example:
_:specializedGene rdfs:subClassOf genes:gene.


Another thing is that I do not agree with the pattern literals.
If we take a look at such a pattern:

"?SUBJ a genes:gene",

we see there are a lot of implicit things here.
First, the prefix needs to be looked up using the r2r:prefixDefinitions predicate. So a specific syntax (Turtle prefixes) is tied to a conceptual model. I can imagine a lot of problems here. Clearly, r2r:prefixDefinitions is some kind of functional property. But when are two prefixDefinitions the same? Exact string comparison is not the answer.
But the bigger problem I'm having is with the variables. With the ?SUBJ notation, you seem to add implicit support for universal quantification. This last sentence clarifies the big issue: "implicit". Variables are placeholders identified by a certain name in a certain scope, but the name itself is unimportant.

Concretely, "?SUBJ a genes:gene" should mean the same as "?s a genes:gene". Except that it doesn't.
Because now, "?SUBJ a smwcat:Gene" is no longer meaningful. (Similar to the above, how to define equality?)
And okay, you can argue that the scope is not the string, but the RDF document.
But what if I put the second statement in a different document? It's RDF, right, or is this an application-specific configuration file?
And okay, we can say that the scope is the entity it belongs to. But then we have a major problem:

mp:GeneID
  r2r:mappingRef mp:Gene;
  r2r:sourcePattern "?SUBJ genes:GeneId ?x";
  r2r:targetPattern "?SUBJ smwprop:KeggGeneId ?'x'^^xsd:string";

GeneID also uses the ?SUBJ variable, but also has a relationship with Gene. This puts them in the same scope. But clearly, the ?SUBJ from Gene and the ?SUBJ from GeneID should be different. This is a serious problem, which cannot be solved rigorously, so the semantics will remain messy, since variables and scope are not formally defined.

You can invalidate my arguments by saying that this RDF document is only meant for a specific purpose etc. But why use RDF then, which is all about portable semantics? See my question at the top of this e-mail.


As a solution, I would propose a W3C team submission which deals with quantification properly: Notation3 [2]. They really got quantification right. Look how much more semantic (and thus portable!) things become:

mp:hasPathway
	a r2r:PropertyMapping;
	r2r:mappingRef    	mp:Gene;
	r2r:sourcePattern 	"?SUBJ genes:hasPathway ?x";
	r2r:targetPattern	"?SUBJ smwprop:IsInvolvedIn ?x . ?x smwprop:Involves ?SUBJ";

becomes

{
  ?s genes:hasPathway ?x.
}
=>
{
  ?s smwprop:IsInvolvedIn ?x
  ?x smwprop:Involves ?s.
}.

Note how the variables now have proper scoped and meaning. But even quantification isn't necessary here:

genes:hasPathway rdfs:subPropertyOf smwprop:IsInvolvedIn.
genes:hasPathway rdfs:subPropertyOf smwprop:Involves.

This exactly matches the definition of a subproperty [3]: "If a property P' is a super-property of a property P, then all pairs of resources which are related by P are also related by P'."

The major benefit of this is that everything can happen by general-purpose Semantic Web reasoners, which rely on the *explicit* semantics present in the document. The semantics are portable to different situations and contexts.


I'm eager to learn about the reasons of adaption of this custom vocabulary and methodology, and the added value of this approach, instead of relying on standards and widely accepted practices, and how your approach is portable to other contexts.

[1] http://www.assembla.com/code/ldif/git/nodes/ldif/ldif-singlemachine/src/main/resources/ldif/local/example/test2/mappings/ALL-to-Wiki.r2r.ttl?rev=176428845b9594e28a2f0362916de23cc821502c
[2] http://www.w3.org/TeamSubmission/n3/
[3] http://www.w3.org/TR/rdf-schema/#def-subproperty

Sincerely,
-- 
Ruben Verborgh

Ghent University - IBBT
Faculty of Engineering and Architecture
Department of Electronics and Information Systems (ELIS)
Multimedia Lab
Gaston Crommenlaan 8 bus 201
B-9050 Ledeberg-Ghent
Belgium

t: +32 9 33 14959
f: +32 9 33 14896
t secr: +32 9 33 14911
e: ruben.verborgh@ugent.be

URL: http://multimedialab.elis.ugent.be

On 29 Jun 2011, at 15:23, Chris Bizer wrote:

> Hi all,
>  
> we are happy to announce the initial release of the LDIF – Linked Data Integration Framework today.
>  
> LDIF is a software component for building Linked Data applications which translates heterogeneous Linked Data from the Web into
> a clean, local target representation while keeping track of data provenance.
>  
> Applications that consume Linked Data from the Web are confronted with the following two challenges:
>  
> 1. data sources use a wide range of different RDF vocabularies to represent data about the same type of entity.
> 2. the same real-world entity, for instance a person or a place, is identified with different URIs within different data sources.
>  
> The usage of various vocabularies as well as the usage of URI aliases makes it very cumbersome for an application developer to write for instance SPARQL queries against Web data that originates from multiple sources.
>  
> A successful approach to ease using Web data in the application context is to translate heterogeneous data into a single local target vocabulary and to replace URI aliases with a single target URI on the client side before starting to ask SPARQL queries against the data.
>  
> Up-till-now, there have not been any integrated tools available that help application developers with these tasks.
>  
> With LDIF, we try to fill this gap and provide an initial alpha version of an open-source Linked Data Integration Framework that can be used by Linked Data applications to translate Web data and normalize URI aliases.
>  
> For Identity resolution, LDIF builds on the Silk Link Discovery Framework.
> For data translation, LDIF employs the R2R Mapping Framework. 
> 
> More information about LDIF and a concrete usage example is provided on the LDIF website at
>  
> http://www4.wiwiss.fu-berlin.de/bizer/ldif/
>  
> Lots of thanks to
>  
> Andreas Schultz (FUB)
> Andrea Matteini (MES)
> Robert Isele (FUB)
> Christian Becker (MES)
>  
> for their great work on the framework.
>  
> Best,
>  
> Chris
>  
>  
> Acknowledgments
>  
> The development of LIDF is supported in part by Vulcan Inc. as part of its Project Halo and by the EU FP7 project LOD2 - Creating Knowledge out of Interlinked Data (Grant No. 257943).
>  
> --
> Prof. Dr. Christian Bizer
> Web-based Systems Group
> Freie Universität Berlin
> +49 30 838 55509
> http://www.bizer.de
> chris@bizer.de
>
Received on Thursday, 30 June 2011 08:04:36 UTC