15 Ways to Think About Data Quality (Just for a Start)

I don't think data quality is an amorphous, aesthetic, hopelessly subjective
topic. Data "beauty" might be subjective, and the same data may have
different applicability to different tasks, but there are a lot of obvious
and straightforward ways of thinking about the quality of a dataset
independent of the particular preferences of individual beholders. Here are
just some of them:

1. Accuracy: Are the individual nodes that refer to factual information
factually and lexically correct. Like, is Chicago spelled "Chigaco" or does
the dataset say its population is 2.7?

2. Intelligibility: Are there human-readable labels on things, so you can
tell what a thing is when you're looking at? Is there a model, so you can
tell what questions you can ask? If a thing has multiple labels (or a set of
owl:sameAs things havemlutiple labels), do you know which (or if) one is
canonical?

3. Referential correspondence: If a set of data points represents some set
of real-world referents, is there one and only one point per referent? If
you have 9,780 data points representing cities, but 5 of them are "Chicago",
"Chicago, IL", "Metro Chicago", "Metropolitain Chicago, Illinois" and
"Chicagoland", that's bad.

4. Completeness: Where you have data representing a clear finite set of
referents, do you have them all? All the countries, all the states, all the
NHL teams, etc? And if you have things related to these sets, are those
projections complete? Populations of every country? Addresses of arenas of
all the hockey teams?

5. Boundedness: Where you have data representing a clear finite set of
referents, is it unpolluted by other things? E.g., can you get a list of
current real countries, not mixed with former states or fictional empires or
adminstrative subdivisions?

6. Typing: Do you really have properly typed nodes for things, or do you
just have literals? The first president of the US was not "George
Washington"^^xsd:string, it was a person whose name-renderings include
"George Washington". Your ability to ask questions will be constrained or
crippled if your data doesn't know the difference.

7. Modeling correctness: Is the logical structure of the data properly
represented? Graphs are relational databases without the crutch of "rows";
if you screw up the modeling, your queries will produce garbage.

8. Modeling granularity: Did you capture enough of the data to actually make
use of it. ":us :president :george_washington" isn't exactly wrong, but it's
pretty limiting. Model presidencies, with their dates, and you've got much
more powerful data.

9. Connectedness: If you're bringing together datasets that used to be
separate, are the join points represented properly. Is the US from your
country list the same as (or owl:sameAs) the US from your list of
presidencies and the US from your list of world cities and their
populations?

10. Isomorphism: If you're bring together datasets that used to be separate,
are their models reconciled? Does an album contain songs, or does it contain
tracks which are publications of recordings of songs, or something else? If
each data point answers this question differently, even simple-seeming
queries may be intractable.

11. Currency: Is the data up-to-date?

12. Directionality: Can you navigate the logical binary relationships in
either direction? Can you get from a country to its presidencies to their
presidents, or do you have to know to only ask about presidents'
presidencies' countries? Or worse, do you have to ask every question in
permutations of directions because some data asserts things one way and some
asserts it only the other?

13. Attribution: If your data comes from multiple sources, or in multiple
batches, can you tell which came from where?

14. History: If your data has been edited, can you tell how and by whom?

15. Internal consistency: Do the populations of your counties add up to the
populations of your states? Do the substitutes going into your soccer
matches balance the substitutes going out?


That's by no means an exhaustive list, and I didn't even start on the kinds
of quality you can start talking about if you widen the scope of what you
mean by "a dataset" to include the environment in which it's made available:
performance, query repeatability, explorational fluidity, expressiveness of
inquiry, analytic power, UI intelligibility, openness...

Received on Monday, 11 April 2011 20:47:26 UTC