W3C home > Mailing lists > Public > public-lod@w3.org > April 2011

Re: 15 Ways to Think About Data Quality (Just for a Start)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Tue, 12 Apr 2011 07:48:58 -0400
Message-ID: <4DA43C2A.6020305@openlinksw.com>
To: Deborah MacPherson <debmacp@gmail.com>
CC: glenn mcdonald <gmcdonald@furia.com>, "public-lod@w3.org" <public-lod@w3.org>
On 4/11/11 10:01 PM, Deborah MacPherson wrote:
> The geographic/cartographic examples are perfect. Every service level
> could benefit from higher quality linked data

So given a massive corpus of data, doesn't "subjectively" bad data lead 
to "subjectively" better data?

One of the biggest problems in Semantic Web land used to be actual 
existence of data. Linked Data addressed that problem, and many Linked 
Data tools have emerged that make data visible using a variety of 
presentation metaphors.

As part of conversations about data, you do need to able to see the 
"subjectively" bad to make it "subjectively" good. What you can't do 
(which is what Glenn does repeatedly) is conflate the tools that 
actually enable you see the subjectively "good, bad, or ugly" with said 

The iterative data quality pursuit process -- as I see it -- is goes 
something like this:

1. Data needs to exist -- be created
2. Data needs to be accessible -- at some Address
3. Data needs to be understood via its structure
2. Data needs to be presented -- various metaphors
3. Data needs to be disseminated -- sharing URLs for example
4. Data quality issues are discussed amongst consumers bearing in mind 
their subjective context lenses -- consumers discuss with producers e.g. 
when there are critical errors in line with specific dataset use etc..
5. Iterate.


> Deborah MacPherson
> On 4/8/11, glenn mcdonald<gmcdonald@furia.com>  wrote:
>> I don't think data quality is an amorphous, aesthetic, hopelessly subjective
>> topic. Data "beauty" might be subjective, and the same data may have
>> different applicability to different tasks, but there are a lot of obvious
>> and straightforward ways of thinking about the quality of a dataset
>> independent of the particular preferences of individual beholders. Here are
>> just some of them:
>> 1. Accuracy: Are the individual nodes that refer to factual information
>> factually and lexically correct. Like, is Chicago spelled "Chigaco" or does
>> the dataset say its population is 2.7?
>> 2. Intelligibility: Are there human-readable labels on things, so you can
>> tell what a thing is when you're looking at? Is there a model, so you can
>> tell what questions you can ask? If a thing has multiple labels (or a set of
>> owl:sameAs things havemlutiple labels), do you know which (or if) one is
>> canonical?
>> 3. Referential correspondence: If a set of data points represents some set
>> of real-world referents, is there one and only one point per referent? If
>> you have 9,780 data points representing cities, but 5 of them are "Chicago",
>> "Chicago, IL", "Metro Chicago", "Metropolitain Chicago, Illinois" and
>> "Chicagoland", that's bad.
>> 4. Completeness: Where you have data representing a clear finite set of
>> referents, do you have them all? All the countries, all the states, all the
>> NHL teams, etc? And if you have things related to these sets, are those
>> projections complete? Populations of every country? Addresses of arenas of
>> all the hockey teams?
>> 5. Boundedness: Where you have data representing a clear finite set of
>> referents, is it unpolluted by other things? E.g., can you get a list of
>> current real countries, not mixed with former states or fictional empires or
>> adminstrative subdivisions?
>> 6. Typing: Do you really have properly typed nodes for things, or do you
>> just have literals? The first president of the US was not "George
>> Washington"^^xsd:string, it was a person whose name-renderings include
>> "George Washington". Your ability to ask questions will be constrained or
>> crippled if your data doesn't know the difference.
>> 7. Modeling correctness: Is the logical structure of the data properly
>> represented? Graphs are relational databases without the crutch of "rows";
>> if you screw up the modeling, your queries will produce garbage.
>> 8. Modeling granularity: Did you capture enough of the data to actually make
>> use of it. ":us :president :george_washington" isn't exactly wrong, but it's
>> pretty limiting. Model presidencies, with their dates, and you've got much
>> more powerful data.
>> 9. Connectedness: If you're bringing together datasets that used to be
>> separate, are the join points represented properly. Is the US from your
>> country list the same as (or owl:sameAs) the US from your list of
>> presidencies and the US from your list of world cities and their
>> populations?
>> 10. Isomorphism: If you're bring together datasets that used to be separate,
>> are their models reconciled? Does an album contain songs, or does it contain
>> tracks which are publications of recordings of songs, or something else? If
>> each data point answers this question differently, even simple-seeming
>> queries may be intractable.
>> 11. Currency: Is the data up-to-date?
>> 12. Directionality: Can you navigate the logical binary relationships in
>> either direction? Can you get from a country to its presidencies to their
>> presidents, or do you have to know to only ask about presidents'
>> presidencies' countries? Or worse, do you have to ask every question in
>> permutations of directions because some data asserts things one way and some
>> asserts it only the other?
>> 13. Attribution: If your data comes from multiple sources, or in multiple
>> batches, can you tell which came from where?
>> 14. History: If your data has been edited, can you tell how and by whom?
>> 15. Internal consistency: Do the populations of your counties add up to the
>> populations of your states? Do the substitutes going into your soccer
>> matches balance the substitutes going out?
>> That's by no means an exhaustive list, and I didn't even start on the kinds
>> of quality you can start talking about if you widen the scope of what you
>> mean by "a dataset" to include the environment in which it's made available:
>> performance, query repeatability, explorational fluidity, expressiveness of
>> inquiry, analytic power, UI intelligibility, openness...



Kingsley Idehen	
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen
Received on Tuesday, 12 April 2011 11:49:22 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:16:13 UTC