Re: How To Do Deal with the Subjective Issue of Data Quality? from Kingsley Idehen on 2011-04-11 (public-lod@w3.org from April 2011)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Mon, 11 Apr 2011 07:39:43 -0400
To: Marco Fossati <fossati@fbk.eu>
CC: public-lod@w3.org, semantic-web@w3.org
Message-ID: <4DA2E87F.3090603@openlinksw.com>
On 4/11/11 6:51 AM, Marco Fossati wrote:
> Hi Christian and everyone,
>
> When working with strongly heterogeneous data, quality is fundamental: 
> it lets us exploiting the whole potential of the data.
> As I am working with data coming from very local and domain-specific 
> realities (e.g. the tourism portal of a small geographical area), I 
> would like to stress real world usage and applications. With such a 
> focus in mind, in my opinion there are two ways to leverage quality:
>
>    1. Persuading the data publishers we are dealing with to expose
>       better quality data, because they can benefit from it;
>    2. Performing a data restructuration on our own, by trying to find
>       out the rules that should fix the problems coming from data
>       sources (i.e. the problems created by the data publishers).
>
> At the moment, I am discarding the first point, as it seems much more 
> effort demanding than the second one. In practice, telling someone who 
> is not initiated to Semantic Web technologies that he has to expose 
> its data in RDF because he can earn some money in a short term is a 
> quite complex task.

If you put "RDF" in the conversation you only increase the mental mirage 
quotient of your quest re. monetary value of published data. It's much 
easier if you talk about Linked Data as hyperlinks between disparate 
data items across the Web that deliver perpetual enrichment via network 
effects of the InterWeb. There are many simple examples with regards to 
lookups that can simplify the value prop. of Linked Data data esp. you 
have "Wikipedia as a Database" as an easy to demonstrate option via 
DBpedia, for instance.

> Therefore, the creation of a data quality management ontology is very 
> interesting, even if I fear that it could add complexity to an already 
> complex issue.

If we connect rather than disconnect with our target audiences by 
understanding their terminology first, we'll be more inclined to make 
terminology links between ours and those of the value prop. target. This 
is a zillion times better than reciting mantras that boil down to: our 
terminology or nothing.

> In conclusion, the main question is: how could we write data quality 
> constraints (via an implementation of 
> http://semwebquality.org/documentation/primer/20101124/index.html for 
> example) for transforming data generally coming from non-RDF formats 
> (CSV, XML, microformats-annotated web pages)?

Like most things amongst cognitive entities, we ultimately have to put 
conversations about data into the data itself. Beyond that, it just 
boils down to the subjective needs of the data beholder or consumer.


Kingsley
>
> Cheers,
>
> Marco
> FBK Web of Data unit
> http://fbk.eu
> http://wed.fbk.eu/
>
> On 4/7/11 10:34 PM, Christian Fuerber wrote:
>> Hi Kingsley,
>> IMO, data quality is the degree to which data fulfills quality requirements.
>> As you said, quality requirements are subjective  and, therefore, can be
>> very heterogeneous and contradictory, even in closed settings. In my eyes,
>> the most effective way to get a hand on data quality is to explicitly
>> represent, manage, and share quality requirements. This way, we can agree
>> and disagree about them while we can always view each other's quality
>> assumptions. This is particularly important, when making statements about
>> the quality of a data source or ontology.
>>
>> Therefore, we have started to create a data quality management ontology
>> which shall facilitate the representation and publication of quality
>> requirements in RDF [1]. An overview presentation what we could do with such
>> an ontology is available at [2]. As soon as we have a stable version, we
>> plan to publish it athttp://semwebquality.org  for public.
>>
>> However, no matter how hard we are trying to establish a high level of data
>> quality, I believe that it is almost impossible to achieve 100 %, especially
>> due to the heterogeneous requirements. But we should try to approximate and
>> keep up a high level.
>>
>> Please, let me know what you think about our approach.
>>
>> [1]http://www.heppnetz.de/files/dataquality-vocab-lwdm2011.pdf
>> [2]
>> http://www.slideshare.net/cfuerber/towards-a-vocabulary-for-data-quality-man
>> agement-in-semantic-web-architectures
>>
>> Cheers,
>> Christian
>>
>> ------------------------------------------
>> Dipl.-Kfm. Christian Fürber
>> Professur für Allgemeine BWL, insbesondere E-Business e-business&  web
>> science research group Universität der Bundeswehr München
>>
>> e-mail:c.fuerber@unibw.de
>> www:http://www.unibw.de/ebusiness/
>> skype: c.fuerber
>> twitter: cfuerber
>>
>>
>>


-- 

Regards,

Kingsley Idehen	
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen
Received on Monday, 11 April 2011 11:42:31 UTC