Take2: 15 Ways to Think About Data Quality (Just for a Start) from Kingsley Idehen on 2011-04-15 (public-lod@w3.org from April 2011)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Fri, 15 Apr 2011 08:48:28 -0400
To: "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <4DA83E9C.6010701@openlinksw.com>

All,

Here is a new thread re. topic above. I believe this matter is
interesting to many, so I've basically taken the original post (modulo
1st sentence in the original) and also appended comments from Dave
Reynolds which he contributed as an extension of Glenn's original list.
In addition, I've added a link to the answers.com thread [1] as they
route may also be preferable to some of you.

Please feel free to discuss this important topic. We can do this without
acrimony :-)

Data "beauty" might be subjective, and the same data may have different
applicability to different tasks, but there are a lot of obvious and
straightforward ways of thinking about the quality of a dataset
independent of the particular preferences of individual beholders. Here
are just some of them:

1. Accuracy: Are the individual nodes that refer to factual information
factually and lexically correct. Like, is Chicago spelled "Chigaco" or
does the dataset say its population is 2.7?

2. Intelligibility: Are there human-readable labels on things, so you
can tell what a thing is when you're looking at? Is there a model, so
you can tell what questions you can ask? If a thing has multiple labels
(or a set of owl:sameAs things havemlutiple labels), do you know which
(or if) one is canonical?

3. Referential correspondence: If a set of data points represents some
set of real-world referents, is there one and only one point per
referent? If you have 9,780 data points representing cities, but 5 of
them are "Chicago", "Chicago, IL", "Metro Chicago", "Metropolitain
Chicago, Illinois" and "Chicagoland", that's bad.

4. Completeness: Where you have data representing a clear finite set of
referents, do you have them all? All the countries, all the states, all
the NHL teams, etc? And if you have things related to these sets, are
those projections complete? Populations of every country? Addresses of
arenas of all the hockey teams?

5. Boundedness: Where you have data representing a clear finite set of
referents, is it unpolluted by other things? E.g., can you get a list of
current real countries, not mixed with former states or fictional
empires or adminstrative subdivisions?

6. Typing: Do you really have properly typed nodes for things, or do you
just have literals? The first president of the US was not "George
Washington"^^xsd:string, it was a person whose name-renderings include
"George Washington". Your ability to ask questions will be constrained
or crippled if your data doesn't know the difference.

7. Modeling correctness: Is the logical structure of the data properly
represented? Graphs are relational databases without the crutch of
"rows"; if you screw up the modeling, your queries will produce garbage.

8. Modeling granularity: Did you capture enough of the data to actually
make use of it. ":us :president :george_washington" isn't exactly wrong,
but it's pretty limiting. Model presidencies, with their dates, and
you've got much more powerful data.

9. Connectedness: If you're bringing together datasets that used to be
separate, are the join points represented properly. Is the US from your
country list the same as (or owl:sameAs) the US from your list of
presidencies and the US from your list of world cities and their
populations?

10. Isomorphism: If you're bring together datasets that used to be
separate, are their models reconciled? Does an album contain songs, or
does it contain tracks which are publications of recordings of songs, or
something else? If each data point answers this question differently,
even simple-seeming queries may be intractable.

11. Currency: Is the data up-to-date?

12. Directionality: Can you navigate the logical binary relationships in
either direction? Can you get from a country to its presidencies to
their presidents, or do you have to know to only ask about presidents'
presidencies' countries? Or worse, do you have to ask every question in
permutations of directions because some data asserts things one way and
some asserts it only the other?

13. Attribution: If your data comes from multiple sources, or in
multiple batches, can you tell which came from where?

14. History: If your data has been edited, can you tell how and by whom?

15. Internal consistency: Do the populations of your counties add up to
the populations of your states? Do the substitutes going into your
soccer matches balance the substitutes going out?

That's by no means an exhaustive list, and I didn't even start on the
kinds of quality you can start talking about if you widen the scope of
what you mean by "a dataset" to include the environment in which it's
made available: performance, query repeatability, explorational
fluidity, expressiveness of inquiry, analytic power, UI intelligibility,
openness...

Plus these comments from Dave Reynolds:

That's a fantastic list and should be recorded on a wiki somewhere!

A minor quibble, not sure about Directionality. You can follow an RDF
link in both directions (at least in SPARQL and any RDF API I've worked
with). I would be inclined to generalize and rephrase this as ...

"Consistency of modelling: whichever way you make modelling decisions
such as direction of relations (from country to president, from
president to country) it is done consistently so you don't have to ask
many permutations of the same query."
Possible additions:

"Licensed: the license under which the data can be used is clearly
defined, ideally in a machine checkable way."

"Sustainable: there is some credible basis for believing the data will
be maintained as current (e.g. backed by some appropriate organization
or by a sufficiently large group of individuals, has been updated
frequently in the past)."

"Authoritative: is the provider of the data a credible authority on the
subject. For example, in the UK then Companies House has the definitive
information on registered UK companies and no amount of crowd sourcing
can change that fact that if the company is not registered with them
then it is not registered :)"

Links:

1.
http://answers.semanticweb.com/questions/1072/quality-indicators-for-linked-data-datasets?sort=votes&page=2
-- discussion / wiki extension of the conversation on
answers.semanticweb.com

Regards,

Kingsley Idehen
President& CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Received on Friday, 15 April 2011 12:48:51 UTC