- From: Christian Fuerber <c.fuerber@unibw.de>
- Date: Mon, 11 Apr 2011 20:17:54 +0200
- To: "'Marco Fossati'" <fossati@fbk.eu>
- Cc: <kidehen@openlinksw.com>, <public-lod@w3.org>, <semantic-web@w3.org>
Hi Marco, please note that the data quality management (DQM) ontology is something different than the data quality constraints library! We hope to publish the first stable version of the DQM ontology soon, so you can try if this will work for you. In the meanwhile, you have in my eyes two options: 1. Check and cleanse data right after its extraction and before its storage in the target data store: This option would be a typical scenario for more traditional data quality tools used in data warehouses, such Informatica data quality or talend data quality. But I do not know whether these tools also offer transformation modules for RDF. A good alternative for data cleansing is Google Refine [1] with its RDF-Extension [2]. 2. Load the data as usual into the triplestore and do quality checks and data cleansing there: In this option, you could use our DQ constraints library [3-5] to spot DQ problems with the advantage that you can store your DQ requirements as SPIN [8] constraints, so that you only have to define them once. Once, you have identified the problems, you can remove them manually via SPARQL/Update queries or directly, e.g. via the GUI in TopBraid Composer. Soon you will be able to store data cleansing rules with the DQM ontology and use them via SPARQL/Update queries to automate data cleansing [6-7]. IMO, a holistic data quality management approach should consider both alternatives: data cleansing during data acquisition and data quality monitoring of stored data (unless you can make sure that you only load high quality data into your triplestore). The data quality management ontology has the advantage that data quality requirements and data cleansing rules are explicitly represented in RDF, so that SPARQL queries can do the job in an automated way. Regarding complexity: I think that we can reduce complexity with the DQM ontology compared to approaches that hide data quality requirements and data cleansing rules in code. Cheers, Christian [1] Google Refine, http://code.google.com/p/google-refine/ [2] RDF-Extension for Google Refine, http://lab.linkeddata.deri.ie/2010/grefine-rdf-extension/ [3] DQ-Constraints Library, http://semwebquality.org/ontologies/dq-constraints.rdf [4] DQ-Constraints-Library Documentation, http://semwebquality.org/ontologies/dq-constraints.html [5] DQ-Constraints-Library Primer, http://semwebquality.org/documentation/primer/20101124/index.html [6] LWDM-Paper Towards a Vocabulary for Data Quality Management in Semantic Web Architectures, http://www.heppnetz.de/files/dataquality-vocab-lwdm2011.pdf [7] Presentation of paper in [6], http://slidesha.re/hul4GV [8] SPIN, http://spinrdf.org Von: Marco Fossati [mailto:fossati@fbk.eu] Gesendet: Montag, 11. April 2011 12:52 An: Christian Fuerber Cc: kidehen@openlinksw.com; public-lod@w3.org; semantic-web@w3.org Betreff: Re: How To Do Deal with the Subjective Issue of Data Quality? Hi Christian and everyone, When working with strongly heterogeneous data, quality is fundamental: it lets us exploiting the whole potential of the data. As I am working with data coming from very local and domain-specific realities (e.g. the tourism portal of a small geographical area), I would like to stress real world usage and applications. With such a focus in mind, in my opinion there are two ways to leverage quality: 1. Persuading the data publishers we are dealing with to expose better quality data, because they can benefit from it; 2. Performing a data restructuration on our own, by trying to find out the rules that should fix the problems coming from data sources (i.e. the problems created by the data publishers). At the moment, I am discarding the first point, as it seems much more effort demanding than the second one. In practice, telling someone who is not initiated to Semantic Web technologies that he has to expose its data in RDF because he can earn some money in a short term is a quite complex task. Therefore, the creation of a data quality management ontology is very interesting, even if I fear that it could add complexity to an already complex issue. In conclusion, the main question is: how could we write data quality constraints (via an implementation of http://semwebquality.org/documentation/primer/20101124/index.html for example) for transforming data generally coming from non-RDF formats (CSV, XML, microformats-annotated web pages)? Cheers, Marco FBK Web of Data unit http://fbk.eu http://wed.fbk.eu/ On 4/7/11 10:34 PM, Christian Fuerber wrote: Hi Kingsley, IMO, data quality is the degree to which data fulfills quality requirements. As you said, quality requirements are subjective and, therefore, can be very heterogeneous and contradictory, even in closed settings. In my eyes, the most effective way to get a hand on data quality is to explicitly represent, manage, and share quality requirements. This way, we can agree and disagree about them while we can always view each other's quality assumptions. This is particularly important, when making statements about the quality of a data source or ontology. Therefore, we have started to create a data quality management ontology which shall facilitate the representation and publication of quality requirements in RDF [1]. An overview presentation what we could do with such an ontology is available at [2]. As soon as we have a stable version, we plan to publish it at http://semwebquality.org for public. However, no matter how hard we are trying to establish a high level of data quality, I believe that it is almost impossible to achieve 100 %, especially due to the heterogeneous requirements. But we should try to approximate and keep up a high level. Please, let me know what you think about our approach. [1] http://www.heppnetz.de/files/dataquality-vocab-lwdm2011.pdf [2] http://www.slideshare.net/cfuerber/towards-a-vocabulary-for-data-quality-man agement-in-semantic-web-architectures Cheers, Christian ------------------------------------------ Dipl.-Kfm. Christian Fürber Professur für Allgemeine BWL, insbesondere E-Business e-business & web science research group Universität der Bundeswehr München e-mail: c.fuerber@unibw.de www: http://www.unibw.de/ebusiness/ skype: c.fuerber twitter: cfuerber
Received on Monday, 11 April 2011 18:23:11 UTC