Re: 15 Ways to Think About Data Quality (Just for a Start) from glenn mcdonald on 2011-04-12 (public-lod@w3.org from April 2011)

From: glenn mcdonald <glenn@furia.com>
Date: Tue, 12 Apr 2011 09:33:05 -0400
To: Kingsley Idehen <kidehen@openlinksw.com>
Cc: Deborah MacPherson <debmacp@gmail.com>, "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <BANLkTi=7zinGTUv7O3WTxQ3kxH5C1N4Svw@mail.gmail.com>

>
> As part of conversations about data, you do need to able to see the
> "subjectively" bad to make it "subjectively" good. What you can't do (which
> is what Glenn does repeatedly) is conflate the tools that actually enable
> you see the subjectively "good, bad, or ugly" with said data.
>

I'm a tool developer with "first hand" experience, as you put it, too. I'm
not conflating the tools and the data. But the complete data experience is
the product of the tools and the data.

 Is Excel rendered useless because a list of countries with obvious errors
> was presented in the spreadsheet? To an audience of Spreadsheet developers
> (programmers making a Spreadsheet product) that's irrelevant


That attitude is how Excel ended up with essentially no real data-cleaning
tools, which is pathetic. The job of data tools is to mediate between people
and computers, and thus helping people identify and understand and fix and
improve data is just as much the tools' (and tool developers')
responsibility as showing you a list of entity URIs. The list of
data-quality metrics is also effectively a data-tool task list.

this is why my demos are oriented towards enabling the beholder disambiguate
> his/her/its quest via filtering applied to entity types and other
> properties.


Which is what I was talking about in Boundedness: does the data have the
properties you need to extract the subset you want. E.g., Danny Ayers
yesterday was trying to make a SPARQL query for Wordnet that found the
planets in the solar system that aren't named after Roman gods. But neither
he nor I could find any way in the data to distinguish actual planets in the
list of solar bodies, so we couldn't quite make it right. That was a data
problem, not a tool problem. But the difficulty of figuring this out, *using
* the tools, was a tool problem.

But of the 17 other qualities on my list + Dave's additions, at least 15 of
them directly bear on the feasibility of using filtering to extract a good
subset out of a flawed corpus.

Received on Tuesday, 12 April 2011 13:33:53 UTC