Re: Schema Mappings (was Re: AW: ANN: LOD Cloud - Statistics and compliance with best practices)

Hi Leigh and Enrico,

> Hi,
>
> On 22 October 2010 09:35, Chris Bizer <chris@bizer.de> wrote:
>>> Anja has pointed to a wealth of openly
>>> available numbers (no pun intended), that have not been discussed at
all.
>> For
>>> example, only 7.5% of the data source provide a mapping of "proprietary
>>> vocabulary terms" to "other vocabulary terms". For anyone building
>>> applications to work with LOD, this is a real problem.
>>
>> Yes, this is also the figure that scared me most.
>
> This might be low for a good reason: people may be creating
> proprietary terms because they don't feel well served by existing
> vocabularies and hence defining mappings (or even just reusing terms)
> may be difficult or even impossible.

Yes, this is true in many cases and for a given point in time.

But altogether I think it is important to see web-scale data integration
more in an evolutionary fashion in which different factors play together
over time. 

In my opinion these factors are:

1. An increasing amount of people start to use existing vocabularies which
already solves the integration problem in some areas simply by agreement on
these vocabularies.
2. More and more instance data is becoming available on the Web, which makes
it easier to mine schema mappings using statistical methods.
3. Different groups in various areas want to contribute to solving the
integration problem and thus invest effort in manually aligning vocabularies
(for instance between different standards used in the libraries community or
for people and provenance related vocabularies within the W3C Social Web and
Provenance XGs).
4. The Web allows you to share mappings by publishing them as RDF. Thus many
different people and groups may provide small contributions (= hints) that
help to solve the problem in the long run.

My thinking on the topic was strongly influenced by the pay-as-you-go data
integration ideas developed by Alon Halevy and other people in the
dataspaces community. A cool paper on the topic is in my opinion:

Web-Scale Data Integration: You can afford to Pay as You Go. Madhavan, J.;
Cohen, S.; Dong, X.; Halevy, A.; Jeffery, S.; Ko, D.; Yu, C., CIDR (2007)
http://research.yahoo.com/files/paygo.pdf

describing a system that applies schema clustering in order to mine mappings
from Google Base and web table data and presents ideas on how you can deal
with the uncertainty that you introduce using ranking algorithms.

Other interesting papers in the area are:

Das Sarma, A., Dong, X., Halevy, A.: Bootstrapping pay-as-you-go data
integration
systems. Proceedings of the Conference on Management of Data, SIGMOD (2008)

Vaz Salles, M.A., Dittrich, J., Karakashian, S.K., Girard, O.R., Blunschi,
L.: iTrails: Payas-you-go Information Integration in Dataspaces. In:
Conference of Very large Data Bases
(VLDB 2007), 663-674 (2007)

Franklin, M.J., Halevy, A.Y., Maier, D.: From databases to dataspaces: A new
abstraction
for information management. SIGMOD Record 34(4), pp. 27–33 (2005)

Hedeler, C., et al.: Dimensions of Dataspaces. In: Proceedings of the 26th
British National
Conference on Databases, pp. 55-66 (2009)

These guys always have the idea that mappings are added to a dataspace by
administrators or mined using a single, specific method.

What I think is interesting in the Web of Linked Data setting is that
mappings can be created and published by different parties to a single
global dataspace. Meaning that the necessary effort to create the mappings
can be divided between different parties. So pay-as-you-go might evolve into
somebody-pay-as-you-go :-)
But of course also meaning that the quality of mappings is becoming
increasingly uncertain and that the information consumer needs to assess the
quality of mappings and decide which ones it wants to use.

We are currently exploring this problem space and will present a paper about
publishing and discovering mappings on the Web of Linked Data at the COLD
workshop at ISWC 2010.

http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/research/publications/
BizerSchultz-COLD-R2R-Paper.pdf

Central ideas of the paper are that:
1. you identify mappings with URIs so that they can be interlinked from
vocabulary definitions or void dataset descriptions and so that client
applications as well as Web of Data search engines can discover them.
2. A client application which discovers data that is represented using terms
that are unknown to the application may search the Web for mappings, apply a
quality evaluation heuristic to decide which alternative mappings to use and
then apply the chosen
mappings to translate data to its local schema. 

> This also strikes me as an opportunity: someone could usefully build a
> service (perhaps built on facilities in Sindice) that aggregated
> schema information and provides tools for expressing simple mappings
> and equivalencies. It could fill a dual role: recommend more
> common/preferred terms, whilst simultaneously providing
> machine-readable equivalencies.

Absolutely, there might even be opportunities to set up businesses in this
area.

> I know that Uberblic provides some mapping tools in this area,
> allowing for the creation of a more normalized view across the web,
> but not sure how much of that is resurfaced.

I think that Uberblic republishes the normalized data.
Georgi: Correct me if I'm wrong!

>From Enrico:
> I happen to agree with Martin here.
> My concern is that the naïveté of most of the research in LOD
> creates the illusion that data integration is an easily solvable 
> problem -- while it is well known that it is the most important 
> open problem in the database community (30+ years of research) 
> where there is a huge amount of money, research, and resources 
> invested in it. This will eventually fire back to us - the whole 
> community including me - since people will not trust us anymore.

I agree with you that data integration is - beside of data quality
assessment - the biggest challenge that the Web of Linked Data currently
faces. The problem is clearly not solved, but I'm not so pessimistic about
it as you are. Hey, isn't it the idea of science to pick up hard challenges
and try to make progress on them?

As I said above, the two new aspects that are added by the Web of Data to
the problem/solution space are that
1. by having lots of instance data available on the Web, it becomes easier
to use statistical methods to mine correspondences.
2. the Web adds a social dimension to the integration problem. Meaning that
different (many) interested parties can invest effort and help solving the
problem (by defining or mining correspondences, publishing them on the Web,
rating the quality of correspondences, and so on).

Sure, on the theoretical level, the problem is very hard. But in many
practical cases a lot can already be achieved with a rather small amount of
effort. For instance, there are about 5 vocabularies around in the LOD cloud
that are used to represent basic information about people. Once somebody
defines and publishes mappings between them, things will already work much
more smoothly.

I'm also happy to see that not all people are as pessimistic as you are
concerning the problem. For instance, there was an interesting workshop
organized by Michael Stonebraker in Washington last year, where senior
database people got together and tried to develop a perspective on how to
tackle data integration on global scale. See:

http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=nap12916&part=nap12916.ap
p1
Workshop proceedings:
http://www.nap.edu/openbook.php?record_id=12916&page=R1

Have a nice weekend,

Chris

Received on Saturday, 23 October 2010 10:03:50 UTC