Re: ANN: LDIF - Linked Data Integration Framework V0.1 released. from Chris Bizer on 2011-06-30 (semantic-web@w3.org from June 2011)

From: Chris Bizer <chris@bizer.de>
Date: Thu, 30 Jun 2011 22:34:13 +0200
To: "'Ruben Verborgh'" <ruben.verborgh@ugent.be>
Cc: "'public-lod'" <public-lod@w3.org>, "'Semantic Web'" <semantic-web@w3.org>, <semanticweb@yahoogroups.com>
Message-ID: <012801cc3765$133a6380$39af2a80$@de>
Hi Ruben,

> Thanks for the fast and detailed reply, it's a very interesting
discussion.
>
> Indeed, there are several ways for mapping and identity resolution.
> But what strikes me is that people in the community seem to be
insufficiently aware 
> of the possibilities and performance of current reasoners.

Possibly. But luckily we are today in the position to just give it a try.

So an idea with my Semantic Web Challenge hat on:

Why not take the Billion Triples 2011 data set
(http://challenge.semanticweb.org/) which consists of 2 billion triples that
have been recently crawled from the Web and try to find all data in the
dataset about authors and their publications, map this data to a single
target schema and merge all duplicates.

Our current LDIF in-memory implementation is not capable of doing this as 2
billion triples are too much data for it. But with the planned Hadoop-based
implementation we are hoping to get into this range.

It would be very interesting if somebody else would try to solve the task
above using a reasoned-based approach and we could then compare the number
of authors and publications identified as well as the duration of the data
integration process.

Anybody interested?

Cheers,

Chris


> As you can see the data translation requires lots of structural
> transformations as well as complex property value transformations using
> various functions. All things where current logical formalisms are not
very
> good at.

Oh yes, they are. All needed transformations in your paper can be performed
by at least two reasoners: cwm [1] and EYE [2] by using built-ins [3].
Include are regular expressions, datatype transforms.
Frankly, every transform in the R2R example can be expressed as an N3 rule.

> If I as a application developer
> want to get a job done, what does it help me if I can exchange mappings
> between different tools that all don't get the job done?

Because different tools can contribute different results, and if you use a
common language and idiom, they all can work with the same data and
metadata.

> more and more developers know SPARQL which makes it easier for them to
learn R2R.

The developers that know SPARQL is a proper subset of those that know plain
RDF, which is what I suggest using. And even if rules are necessary, N3 is
only a small extension of RDF.

> Benchmark we have the feeling that SPARQL engines are more suitable for
> this task then current reasoning engines due to their performance problems
> as well as problems to deal with inconsistent data. 

The extremely solid performance [4] of EYE is too little known. It can
achieve things in linear time that other reasoners can never solve.

But my main point is semantics. Why make a new system with its own meanings
and interpretations, when there is so much to do with plain RDF and its
widely known vocabularies (RDFS, OWL)?
Ironically, a tool which contributes to the reconciliation of different RDF
sources, does not use common vocabularies to express well-known
relationships.

Cheers,

Ruben

[1] http://www.w3.org/2000/10/swap/doc/cwm.html
[2] http://eulersharp.sourceforge.net/
[3] http://www.w3.org/2000/10/swap/doc/CwmBuiltins
[4] http://eulersharp.sourceforge.net/2003/03swap/dtb-2010.txt

On 30 Jun 2011, at 10:51, Chris Bizer wrote:

> Hi Ruben,
> 
> thank you for your detailed feedback.
> 
> Of course it is always a question of taste how you prefer to express data
> translation rules and I agree that simple mappings can also be expressed
> using standard OWL constructs.
> 
> When designing the R2R mapping language, we first analyzed the real-world
> requirements that arise if you try to properly integrate data from
existing
> Linked Data on the Web. We summarize our findings in Section 5 of the
> following paper
>
http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/research/publications/
> BizerSchultz-COLD-R2R-Paper.pdf
> As you can see the data translation requires lots of structural
> transformations as well as complex property value transformations using
> various functions. All things where current logical formalisms are not
very
> good at. 
> 
> Others reasons why we choose to base the mapping language on SPARQL where
> that:
> 
> 1. more and more developers know SPARQL which makes it easier for them to
> learn R2R.
> 2. we to be able to translate large amounts (billions of triples in the
> mid-term) of messy inconsistent Web data and from our experience with the
> BSBM Benchmark we have the feeling that SPARQL engines are more suitable
for
> this task then current reasoning engines due to their performance problems
> as well as problems to deal with inconsistent data. 
> 
> I disagree with you that R2R mappings are not suitable for being exchanged
> on the Web. In contrast they were especially designed for being published
> and discovered on the Web and allow partial mappings from different
sources
> to be easily combined (see paper above for details about this).
> 
> I think your argument about the portability of mappings between different
> tools currently is only partially valid. If I as a application developer
> want to get a job done, what does it help me if I can exchange mappings
> between different tools that all don't get the job done?
> 
> Also note, that we aim with LDIF to provide for identity resolution in
> addition to schema mapping. It is well known that identity resolution in
> practical setting requires rather complex matching heuristics (see Silk
> papers for details about different matchers that are usually employed) and
> identity resolution is again a topic where reasoning engines don't have
too
> much to offer.
> 
> But again, there are different ways and tastes about how to express
mapping
> rules and identity resolution heuristics. R2R and Silk LSL are our
> approaches to getting the job done and we are of course happy if other
> people provide working solutions for the task of integrating and cleansing
> messy data from the Web of Linked Data and are happy to compare our
approach
> with theirs.
> 
> Cheers,
> 
> Chris
Received on Thursday, 30 June 2011 20:34:43 UTC