W3C home > Mailing lists > Public > public-lod@w3.org > April 2011

Re: 15 Ways to Think About Data Quality (Just for a Start)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Tue, 12 Apr 2011 08:58:06 -0400
Message-ID: <4DA44C5E.8060608@openlinksw.com>
To: glenn mcdonald <gmcdonald@furia.com>
CC: "public-lod@w3.org" <public-lod@w3.org>
On 4/8/11 9:10 PM, glenn mcdonald wrote:
> I don't think data quality is an amorphous, aesthetic, hopelessly 
> subjective topic. Data "beauty" might be subjective, and the same data 
> may have different applicability to different tasks, but there are a 
> lot of obvious and straightforward ways of thinking about the quality 
> of a dataset independent of the particular preferences of individual 
> beholders. Here are just some of them:
Glenn,

I (and others) have no issue with data quality, we just understand 
(first hand) that when you have a masses of data from disparate sources, 
you discuss and iterate your way subjective sanity via constructive 
feedback loops. Summarily conflating source data quality with data 
access and presentation oriented tools is simply wrong, we all care 
about data quality, but nothing in the world nullifies the fact that 
"quality" is subjective. Is Excel rendered useless because a list of 
countries with obvious errors was presented in the spreadsheet? To an 
audience of Spreadsheet developers (programmers making a Spreadsheet 
product) that's irrelevant, to the accounts or marketing department of 
Spreadsheet product customers (actual users doing their jobs) that's 
important, but it has nothing to do with the Spreadsheet product itself. 
Same analogy would apply to any DBMS product. You have to separate the 
parts is the message I keep on trying to relay to you i.e., stop 
conflating matters in an unnecessarily disruptive way.

Back to data quality discussion:

Subjectively low quality data can lead to subjectively higher quality 
data. Without data all you have is an empty space. Using any form of 
"all or nothing" proposition in a subjective realm is fatally flawed.

How would you address data quality issues in situations where data 
producers, data shape, data consumers, and data presentation tools are 
all loosely coupled ? Bearing in mind your issues with DBpedia and other 
datasets from the LOD cloud, are contributions of quality data from you 
out of the question re., virtuous cycle that's oriented towards 
subjectively improved quality?

I've already made it clear to you that DBpedia contributions are 
welcome, they trump gripping any day, and you would actually be quite 
surprised as to what kind discourse clarity said contributions would 
unveil. Thus, why don't you call my bluff by producing and sharing a 
"data quality" linkset for the LOD cloud?

Note FAO, SUMO, Yago, UMBEL, OpenCyc communities have all contributed 
data to the LOD cloud that enable application of their context lenses to 
linked open data spaces like DBpedia. I spend a lot of time behind the 
scenes working with a variety of people on the very subject of data 
quality, linkset partitioning via named graphs, and conditional 
application of inference contexts via the combination of rules and 
reasoners. Unfortunately, you are so bent on obliterating the start of 
conversations that you don't even recognize different routes to the same 
destination.

As for reconciling a common Referent for multiple Identifiers in a 
Linked Data space comprised of 21 Billion+ triples, lets take a look at 
the subject: Michael Jackson .

1. 
http://lod.openlinksw.com/describe/?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson 
-- basic description of 'Micheal Jackson' from DBpedia

2. 
http://lod.openlinksw.com/fct/rdfdesc/usage.vsp?g=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson 
-- list of source named graphs in the host DBMS

3. 
http://lod.openlinksw.com/fct/rdfdesc/usage.vsp?g=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson&tp=2 
-- list of named graphs with triples that reference this subject

4. 
http://lod.openlinksw.com/fct/rdfdesc/usage.vsp?g=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson&tp=3 
-- explicit owl:sameAs relations across the entire DBMS (clicking on 
each Identifier will unveil the description graph for the Referent of 
said Identifier)

5. 
http://lod.openlinksw.com/fct/rdfdesc/usage.vsp?g=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson&tp=4 
-- use of an InverseFunctionalProperty based rule to generate a fuzzy 
list of Identifiers that potentially share the same Referent (click on 
each link as per prior step)

6. 
http://lod.openlinksw.com/describe/?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson&sas=yes 
-- inference context enhanced description of 'Micheal Jackson' (this is 
a union expansion of all properties across all Identifiers in an 
owl:sameAs relation with DBpedia Entity, hence use of paging re. 
handling result set size.)

7. 
http://lod.openlinksw.com/describe/?url=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson&sas=yes&p=6&lp=7&op=4&prev=&gp=6  
- Page 5 of 8 re. enhanced description of 'Micheal Jackson' .

Steps 1-7 can provide many insights about data the aid subjective 
quality fixes via simple protocols such as consumer notifying publisher 
and in the very worst of cases (agreeing to disagree) the consumer makes 
a linkset, passes it on to the producer, and the producer reciprocates 
by uploading the linkset to a named graph and they also publishes a 
named rule such that when consumer next visits they are able to apply 
their subjective "context lenses" to the data via inference rules. All 
of this happens without imposing 'world views' on any other consumers of 
the data who's needs by vary, subjectively.

The process I outline above is something we do regularly re. the 
datasets hosted in the public instances we oversee. Its why we actually 
have a number of demo rules etc..

Accepting the complexity of subjectivity when the audience diversity is 
integral to a system != ignoring or dismissing the value of data 
quality. I just also happen to have hands on experience dealing this 
problem and its inherently subjectivity.

To conclude, your quality factors aren't invalid, the real challenge and 
question for you is this: how do you cater for this at InterWeb scale 
bearing in mind audience heterogeneity?

-- 

Regards,

Kingsley Idehen	
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen
Received on Tuesday, 12 April 2011 12:58:29 UTC

This archive was generated by hypermail 2.3.1 : Sunday, 31 March 2013 14:24:32 UTC