W3C home > Mailing lists > Public > public-lod@w3.org > April 2011

Re: 15 Ways to Think About Data Quality (Just for a Start)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Tue, 12 Apr 2011 09:48:10 -0400
Message-ID: <4DA4581A.9010908@openlinksw.com>
To: glenn mcdonald <glenn@furia.com>
CC: Deborah MacPherson <debmacp@gmail.com>, "public-lod@w3.org" <public-lod@w3.org>
On 4/12/11 9:33 AM, glenn mcdonald wrote:
>     As part of conversations about data, you do need to able to see
>     the "subjectively" bad to make it "subjectively" good. What you
>     can't do (which is what Glenn does repeatedly) is conflate the
>     tools that actually enable you see the subjectively "good, bad, or
>     ugly" with said data.
> I'm a tool developer with "first hand" experience, as you put it, too. 
> I'm not conflating the tools and the data. But the complete data 
> experience is the product of the tools and the data.

But who ever told you, or inferred to you, that any LOD demo is about 
the "Complete Linked Data Experience" let alone the "Complete Data 
Experience". Who even knows, emphatically, what the so called "Complete 
Data Experience" actually is? That's as subjective a statement as I've 
every heard. Its the very line that continues to separate us.

I might have my own perception of the aforementioned experience, but I 
have no business enforcing that on anyone else, its just my world view, 
end of story.

Thus, I hold my position re. your subjective conflation of matters.

When people publish demos of their products, they aren't publishing the 
demos for "your world view" they are publishing it from theirs, first. 
Of course, bearing in mind our similarities and disparities as cognitive 
beings there is varied potential for intersection of world views i.e., 
fusion. Naturally, fusion can occur with varying degrees of friction.

>      Is Excel rendered useless because a list of countries with
>     obvious errors was presented in the spreadsheet? To an audience of
>     Spreadsheet developers (programmers making a Spreadsheet product)
>     that's irrelevant
> That attitude is how Excel ended up with essentially no real 
> data-cleaning tools, which is pathetic.

And your comments once again reflect the issues I have with your 

Excel the pathetic dominates the world of spreadsheets. Nuff said. Did 
write an alternative? Why isn't the world using your alternative if such 
a thing exists. Bearing in mind the huge market share of Excel why are 
you overlooking the massive opportunity to cleanup via your superior 

> The job of data tools is to mediate between people and computers, and 
> thus helping people identify and understand and fix and improve data 
> is just as much the tools' (and tool developers') responsibility as 
> showing you a list of entity URIs.

What is a Data Tool? Again, 100% subjective. Some people might think of 
Excel as a Data Tool others see it as something completely different.

> The list of data-quality metrics is also effectively a data-tool task 
> list.
>     this is why my demos are oriented towards enabling the beholder
>     disambiguate his/her/its quest via filtering applied to entity
>     types and other properties.
> Which is what I was talking about in Boundedness: does the data have 
> the properties you need to extract the subset you want. E.g., Danny 
> Ayers yesterday was trying to make a SPARQL query for Wordnet that 
> found the planets in the solar system that aren't named after Roman 
> gods. But neither he nor I could find any way in the data to 
> distinguish actual planets in the list of solar bodies, so we couldn't 
> quite make it right.

And did you post a callout here or on Twitter or anyone else for other 
folks to chime in?

> That was a data problem, not a tool problem. But the difficulty of 
> figuring this out, /using/ the tools, was a tool problem.

But the tools (or your activity) unveiled a critical problem aligned to 
your specific goals. That's subjectively bad data laying foundation for 
subjectively improved data. All you need to do is open up a conversation 
that eventually results in a linkset that fixes the problem and delivers 
the "context lenses" you seek. This is a common and expected issue re. 
Linked Data at any scale, beyond your personal computer or personally 
curated data space.

> But of the 17 other qualities on my list + Dave's additions, at least 
> 15 of them directly bear on the feasibility of using filtering to 
> extract a good subset out of a flawed corpus.

In my world: knowledge starts by discovering what you don't know. Same 
rule applies to data quality, you have to find the broken data before 
you can fix it. Do take issue with the mechanism that helps you find the 
broken data. Of course, take issue if there isn't a feedback loop or the 
loop is clogged with intransigence etc.. Neither is the case in the 
Linked Data realms of interest to me.



Kingsley Idehen	
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen
Received on Tuesday, 12 April 2011 13:48:34 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:16:13 UTC