Re: How To Do Deal with the Subjective Issue of Data Quality? from Paola Di Maio on 2011-04-11 (public-lod@w3.org from April 2011)

From: Paola Di Maio <paola.dimaio@gmail.com>
Date: Mon, 11 Apr 2011 19:56:55 +0100
To: Kingsley Idehen <kidehen@openlinksw.com>
Cc: public-lod@w3.org, semantic-web@w3.org
Message-ID: <BANLkTi=f6w5+MjNK+0F9XDb0f+k-Lp0EUw@mail.gmail.com>
Kingsley and all

sorry for not having replied sooner

Very interested in data quality, you may remember older thread on the uk gov
dev google group

The first question is how to define quality, and disambiguate what you call
the subjective aspects of it, from the objective
along some 'dimension' (dimensions of quality)

Below a set of slides I used in teaching, I may have more supporting
materials in my archive
that contains some pointers on how to handle some aspects of quality
management activities


Apologies if in the slideset  references may be missing, the are supposed to
be read in conjunction
to other materials which included references to sources
(ah, provenance, provenance...). This is somebody else's diagrams I picked
up from some books etc)


http://www.slideshare.net/PaolaDIM/what-is-quality-paola-di-maio


Let me know if there are any ideas in there you may want me to expand on or
explain, elthough should be self explanatory

cheers

PDM






On Mon, Apr 11, 2011 at 7:17 PM, Christian Fuerber <c.fuerber@unibw.de>wrote:

> Hi Marco,
>
> please note that the data quality management (DQM) ontology is something
> different than the data quality constraints library! We hope to publish the
> first stable version of the DQM ontology soon, so you can try if this will
> work for you.
>
> In the meanwhile, you have in my eyes two options:
>
> 1. Check and cleanse data right after its extraction and before its storage
> in the target data store:
> This option would be a typical scenario for more traditional data quality
> tools used in data warehouses, such Informatica data quality or talend data
> quality. But I do not know whether these tools also offer transformation
> modules for RDF.  A good alternative for data cleansing is Google Refine
> [1]
> with its RDF-Extension [2].
>
> 2. Load the data as usual into the triplestore and do quality checks and
> data cleansing there:
> In this option, you could use our DQ constraints library [3-5] to spot DQ
> problems with the advantage that you can store your DQ requirements as SPIN
> [8] constraints, so that you only have to define them once. Once, you have
> identified the problems, you can remove them manually via SPARQL/Update
> queries or directly, e.g. via the GUI in TopBraid Composer. Soon you will
> be
> able to store data cleansing rules with the DQM ontology and use them via
> SPARQL/Update queries to automate data cleansing [6-7].
>
> IMO, a holistic data quality management approach should consider both
> alternatives: data cleansing during data acquisition and data quality
> monitoring of stored data (unless you can make sure that you only load high
> quality data into your triplestore). The data quality management ontology
> has the advantage that data quality requirements and data cleansing rules
> are explicitly represented in RDF, so that SPARQL queries can do the job in
> an automated way.
>
> Regarding complexity: I think that we can reduce complexity with the DQM
> ontology compared to approaches that hide data quality requirements and
> data
> cleansing rules in code.
>
> Cheers,
> Christian
>
> [1] Google Refine, http://code.google.com/p/google-refine/
> [2] RDF-Extension for Google Refine,
> http://lab.linkeddata.deri.ie/2010/grefine-rdf-extension/
> [3] DQ-Constraints Library,
> http://semwebquality.org/ontologies/dq-constraints.rdf
> [4] DQ-Constraints-Library Documentation,
> http://semwebquality.org/ontologies/dq-constraints.html
> [5] DQ-Constraints-Library Primer,
> http://semwebquality.org/documentation/primer/20101124/index.html
> [6] LWDM-Paper Towards a Vocabulary for Data Quality Management in Semantic
> Web Architectures,
> http://www.heppnetz.de/files/dataquality-vocab-lwdm2011.pdf
> [7] Presentation of paper in [6], http://slidesha.re/hul4GV
> [8] SPIN, http://spinrdf.org
>
> Von: Marco Fossati [mailto:fossati@fbk.eu]
> Gesendet: Montag, 11. April 2011 12:52
> An: Christian Fuerber
> Cc: kidehen@openlinksw.com; public-lod@w3.org; semantic-web@w3.org
> Betreff: Re: How To Do Deal with the Subjective Issue of Data Quality?
>
> Hi Christian and everyone,
>
> When working with strongly heterogeneous data, quality is fundamental: it
> lets us exploiting the whole potential of the data.
> As I am working with data coming from very local and domain-specific
> realities (e.g. the tourism portal of a small geographical area), I would
> like to stress real world usage and applications. With such a focus in
> mind,
> in my opinion there are two ways to leverage quality:
> 1. Persuading the data publishers we are dealing with to expose better
> quality data, because they can benefit from it;
> 2. Performing a data restructuration on our own, by trying to find out the
> rules that should fix the problems coming from data sources (i.e. the
> problems created by the data publishers).
> At the moment, I am discarding the first point, as it seems much more
> effort
> demanding than the second one. In practice, telling someone who is not
> initiated to Semantic Web technologies that he has to expose its data in
> RDF
> because he can earn some money in a short term is a quite complex task.
> Therefore, the creation of a data quality management ontology is very
> interesting, even if I fear that it could add complexity to an already
> complex issue.
> In conclusion, the main question is: how could we write data quality
> constraints (via an implementation of
> http://semwebquality.org/documentation/primer/20101124/index.html for
> example) for transforming data generally coming from non-RDF formats (CSV,
> XML, microformats-annotated web pages)?
>
> Cheers,
>
> Marco
> FBK Web of Data unit
> http://fbk.eu
> http://wed.fbk.eu/
>
> On 4/7/11 10:34 PM, Christian Fuerber wrote:
> Hi Kingsley,
> IMO, data quality is the degree to which data fulfills quality
> requirements.
> As you said, quality requirements are subjective  and, therefore, can be
> very heterogeneous and contradictory, even in closed settings. In my eyes,
> the most effective way to get a hand on data quality is to explicitly
> represent, manage, and share quality requirements. This way, we can agree
> and disagree about them while we can always view each other's quality
> assumptions. This is particularly important, when making statements about
> the quality of a data source or ontology.
>
> Therefore, we have started to create a data quality management ontology
> which shall facilitate the representation and publication of quality
> requirements in RDF [1]. An overview presentation what we could do with
> such
> an ontology is available at [2]. As soon as we have a stable version, we
> plan to publish it at http://semwebquality.org for public.
>
> However, no matter how hard we are trying to establish a high level of data
> quality, I believe that it is almost impossible to achieve 100 %,
> especially
> due to the heterogeneous requirements. But we should try to approximate and
> keep up a high level.
>
> Please, let me know what you think about our approach.
>
> [1] http://www.heppnetz.de/files/dataquality-vocab-lwdm2011.pdf
> [2]
>
> http://www.slideshare.net/cfuerber/towards-a-vocabulary-for-data-quality-man
> agement-in-semantic-web-architectures
>
> Cheers,
> Christian
>
> ------------------------------------------
> Dipl.-Kfm. Christian Fürber
> Professur für Allgemeine BWL, insbesondere E-Business e-business & web
> science research group Universität der Bundeswehr München
>
> e-mail: c.fuerber@unibw.de
> www:   http://www.unibw.de/ebusiness/
> skype: c.fuerber
> twitter: cfuerber
>
>
>
>
>
>
>
Received on Monday, 11 April 2011 18:57:24 UTC