Re: 15 Ways to Think About Data Quality (Just for a Start) from Kingsley Idehen on 2011-04-12 (public-lod@w3.org from April 2011)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Tue, 12 Apr 2011 08:58:06 -0400
To: glenn mcdonald <gmcdonald@furia.com>
CC: "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <4DA44C5E.8060608@openlinksw.com>

On 4/8/11 9:10 PM, glenn mcdonald wrote:
> I don't think data quality is an amorphous, aesthetic, hopelessly
> subjective topic. Data "beauty" might be subjective, and the same data
> may have different applicability to different tasks, but there are a
> lot of obvious and straightforward ways of thinking about the quality
> of a dataset independent of the particular preferences of individual
> beholders. Here are just some of them:
Glenn,

I (and others) have no issue with data quality, we just understand
(first hand) that when you have a masses of data from disparate sources,
you discuss and iterate your way subjective sanity via constructive
feedback loops. Summarily conflating source data quality with data
access and presentation oriented tools is simply wrong, we all care
about data quality, but nothing in the world nullifies the fact that
"quality" is subjective. Is Excel rendered useless because a list of
countries with obvious errors was presented in the spreadsheet? To an
audience of Spreadsheet developers (programmers making a Spreadsheet
product) that's irrelevant, to the accounts or marketing department of
Spreadsheet product customers (actual users doing their jobs) that's
important, but it has nothing to do with the Spreadsheet product itself.
Same analogy would apply to any DBMS product. You have to separate the
parts is the message I keep on trying to relay to you i.e., stop
conflating matters in an unnecessarily disruptive way.

Back to data quality discussion:

Subjectively low quality data can lead to subjectively higher quality
data. Without data all you have is an empty space. Using any form of
"all or nothing" proposition in a subjective realm is fatally flawed.

How would you address data quality issues in situations where data
producers, data shape, data consumers, and data presentation tools are
all loosely coupled ? Bearing in mind your issues with DBpedia and other
datasets from the LOD cloud, are contributions of quality data from you
out of the question re., virtuous cycle that's oriented towards
subjectively improved quality?

I've already made it clear to you that DBpedia contributions are
welcome, they trump gripping any day, and you would actually be quite
surprised as to what kind discourse clarity said contributions would
unveil. Thus, why don't you call my bluff by producing and sharing a
"data quality" linkset for the LOD cloud?

Note FAO, SUMO, Yago, UMBEL, OpenCyc communities have all contributed
data to the LOD cloud that enable application of their context lenses to
linked open data spaces like DBpedia. I spend a lot of time behind the
scenes working with a variety of people on the very subject of data
quality, linkset partitioning via named graphs, and conditional
application of inference contexts via the combination of rules and
reasoners. Unfortunately, you are so bent on obliterating the start of
conversations that you don't even recognize different routes to the same
destination.

As for reconciling a common Referent for multiple Identifiers in a
Linked Data space comprised of 21 Billion+ triples, lets take a look at
the subject: Michael Jackson .

1.
http://lod.openlinksw.com/describe/?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson
-- basic description of 'Micheal Jackson' from DBpedia

2.
http://lod.openlinksw.com/fct/rdfdesc/usage.vsp?g=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson
-- list of source named graphs in the host DBMS

3.
http://lod.openlinksw.com/fct/rdfdesc/usage.vsp?g=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson&tp=2
-- list of named graphs with triples that reference this subject

4.
http://lod.openlinksw.com/fct/rdfdesc/usage.vsp?g=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson&tp=3
-- explicit owl:sameAs relations across the entire DBMS (clicking on
each Identifier will unveil the description graph for the Referent of
said Identifier)

5.
http://lod.openlinksw.com/fct/rdfdesc/usage.vsp?g=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson&tp=4
-- use of an InverseFunctionalProperty based rule to generate a fuzzy
list of Identifiers that potentially share the same Referent (click on
each link as per prior step)

6.
http://lod.openlinksw.com/describe/?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson&sas=yes
-- inference context enhanced description of 'Micheal Jackson' (this is
a union expansion of all properties across all Identifiers in an
owl:sameAs relation with DBpedia Entity, hence use of paging re.
handling result set size.)

7.
http://lod.openlinksw.com/describe/?url=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson&sas=yes&p=6&lp=7&op=4&prev=&gp=6
- Page 5 of 8 re. enhanced description of 'Micheal Jackson' .

Steps 1-7 can provide many insights about data the aid subjective
quality fixes via simple protocols such as consumer notifying publisher
and in the very worst of cases (agreeing to disagree) the consumer makes
a linkset, passes it on to the producer, and the producer reciprocates
by uploading the linkset to a named graph and they also publishes a
named rule such that when consumer next visits they are able to apply
their subjective "context lenses" to the data via inference rules. All
of this happens without imposing 'world views' on any other consumers of
the data who's needs by vary, subjectively.

The process I outline above is something we do regularly re. the
datasets hosted in the public instances we oversee. Its why we actually
have a number of demo rules etc..

Accepting the complexity of subjectivity when the audience diversity is
integral to a system != ignoring or dismissing the value of data
quality. I just also happen to have hands on experience dealing this
problem and its inherently subjectivity.

To conclude, your quality factors aren't invalid, the real challenge and
question for you is this: how do you cater for this at InterWeb scale
bearing in mind audience heterogeneity?

Regards,

Kingsley Idehen
President& CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Received on Tuesday, 12 April 2011 12:58:29 UTC