W3C home > Mailing lists > Public > semantic-web@w3.org > April 2011

AW: How To Do Deal with the Subjective Issue of Data Quality?

From: Christian Fuerber <c.fuerber@unibw.de>
Date: Mon, 11 Apr 2011 20:17:54 +0200
To: "'Marco Fossati'" <fossati@fbk.eu>
Cc: <kidehen@openlinksw.com>, <public-lod@w3.org>, <semantic-web@w3.org>
Message-ID: <004301cbf874$c6cadef0$54609cd0$@unibw.de>
Hi Marco,

please note that the data quality management (DQM) ontology is something
different than the data quality constraints library! We hope to publish the
first stable version of the DQM ontology soon, so you can try if this will
work for you.

In the meanwhile, you have in my eyes two options:

1. Check and cleanse data right after its extraction and before its storage
in the target data store:
This option would be a typical scenario for more traditional data quality
tools used in data warehouses, such Informatica data quality or talend data
quality. But I do not know whether these tools also offer transformation
modules for RDF.  A good alternative for data cleansing is Google Refine [1]
with its RDF-Extension [2].

2. Load the data as usual into the triplestore and do quality checks and
data cleansing there:
In this option, you could use our DQ constraints library [3-5] to spot DQ
problems with the advantage that you can store your DQ requirements as SPIN
[8] constraints, so that you only have to define them once. Once, you have
identified the problems, you can remove them manually via SPARQL/Update
queries or directly, e.g. via the GUI in TopBraid Composer. Soon you will be
able to store data cleansing rules with the DQM ontology and use them via
SPARQL/Update queries to automate data cleansing [6-7]. 

IMO, a holistic data quality management approach should consider both
alternatives: data cleansing during data acquisition and data quality
monitoring of stored data (unless you can make sure that you only load high
quality data into your triplestore). The data quality management ontology
has the advantage that data quality requirements and data cleansing rules
are explicitly represented in RDF, so that SPARQL queries can do the job in
an automated way. 

Regarding complexity: I think that we can reduce complexity with the DQM
ontology compared to approaches that hide data quality requirements and data
cleansing rules in code.


[1] Google Refine, http://code.google.com/p/google-refine/ 
[2] RDF-Extension for Google Refine,
[3] DQ-Constraints Library,
[4] DQ-Constraints-Library Documentation,
[5] DQ-Constraints-Library Primer,
[6] LWDM-Paper Towards a Vocabulary for Data Quality Management in Semantic
Web Architectures,
[7] Presentation of paper in [6], http://slidesha.re/hul4GV  
[8] SPIN, http://spinrdf.org  

Von: Marco Fossati [mailto:fossati@fbk.eu] 
Gesendet: Montag, 11. April 2011 12:52
An: Christian Fuerber
Cc: kidehen@openlinksw.com; public-lod@w3.org; semantic-web@w3.org
Betreff: Re: How To Do Deal with the Subjective Issue of Data Quality?

Hi Christian and everyone,

When working with strongly heterogeneous data, quality is fundamental: it
lets us exploiting the whole potential of the data.
As I am working with data coming from very local and domain-specific
realities (e.g. the tourism portal of a small geographical area), I would
like to stress real world usage and applications. With such a focus in mind,
in my opinion there are two ways to leverage quality:
1. Persuading the data publishers we are dealing with to expose better
quality data, because they can benefit from it; 
2. Performing a data restructuration on our own, by trying to find out the
rules that should fix the problems coming from data sources (i.e. the
problems created by the data publishers).
At the moment, I am discarding the first point, as it seems much more effort
demanding than the second one. In practice, telling someone who is not
initiated to Semantic Web technologies that he has to expose its data in RDF
because he can earn some money in a short term is a quite complex task.
Therefore, the creation of a data quality management ontology is very
interesting, even if I fear that it could add complexity to an already
complex issue.
In conclusion, the main question is: how could we write data quality
constraints (via an implementation of
http://semwebquality.org/documentation/primer/20101124/index.html for
example) for transforming data generally coming from non-RDF formats (CSV,
XML, microformats-annotated web pages)?


FBK Web of Data unit

On 4/7/11 10:34 PM, Christian Fuerber wrote: 
Hi Kingsley,
IMO, data quality is the degree to which data fulfills quality requirements.
As you said, quality requirements are subjective  and, therefore, can be
very heterogeneous and contradictory, even in closed settings. In my eyes,
the most effective way to get a hand on data quality is to explicitly
represent, manage, and share quality requirements. This way, we can agree
and disagree about them while we can always view each other's quality
assumptions. This is particularly important, when making statements about
the quality of a data source or ontology.

Therefore, we have started to create a data quality management ontology
which shall facilitate the representation and publication of quality
requirements in RDF [1]. An overview presentation what we could do with such
an ontology is available at [2]. As soon as we have a stable version, we
plan to publish it at http://semwebquality.org for public.  

However, no matter how hard we are trying to establish a high level of data
quality, I believe that it is almost impossible to achieve 100 %, especially
due to the heterogeneous requirements. But we should try to approximate and
keep up a high level. 

Please, let me know what you think about our approach.

[1] http://www.heppnetz.de/files/dataquality-vocab-lwdm2011.pdf 


Dipl.-Kfm. Christian Fürber
Professur für Allgemeine BWL, insbesondere E-Business e-business & web
science research group Universität der Bundeswehr München
e-mail: c.fuerber@unibw.de
www:   http://www.unibw.de/ebusiness/
skype: c.fuerber
twitter: cfuerber
Received on Monday, 11 April 2011 18:23:11 UTC

This archive was generated by hypermail 2.4.0 : Tuesday, 5 July 2022 08:45:24 UTC