Re: 15 Ways to Think About Data Quality (Just for a Start) from Deborah MacPherson on 2011-04-12 (public-lod@w3.org from April 2011)

From: Deborah MacPherson <debmacp@gmail.com>
Date: Mon, 11 Apr 2011 22:01:29 -0400
To: glenn mcdonald <gmcdonald@furia.com>
Cc: "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <BANLkTimaLF-zAOMsTS6e1BVV38CPJ3VuYg@mail.gmail.com>
The geographic/cartographic examples are perfect. Every service level
could benefit from higher quality linked data

Deborah MacPherson

On 4/8/11, glenn mcdonald <gmcdonald@furia.com> wrote:
> I don't think data quality is an amorphous, aesthetic, hopelessly subjective
> topic. Data "beauty" might be subjective, and the same data may have
> different applicability to different tasks, but there are a lot of obvious
> and straightforward ways of thinking about the quality of a dataset
> independent of the particular preferences of individual beholders. Here are
> just some of them:
>
> 1. Accuracy: Are the individual nodes that refer to factual information
> factually and lexically correct. Like, is Chicago spelled "Chigaco" or does
> the dataset say its population is 2.7?
>
> 2. Intelligibility: Are there human-readable labels on things, so you can
> tell what a thing is when you're looking at? Is there a model, so you can
> tell what questions you can ask? If a thing has multiple labels (or a set of
> owl:sameAs things havemlutiple labels), do you know which (or if) one is
> canonical?
>
> 3. Referential correspondence: If a set of data points represents some set
> of real-world referents, is there one and only one point per referent? If
> you have 9,780 data points representing cities, but 5 of them are "Chicago",
> "Chicago, IL", "Metro Chicago", "Metropolitain Chicago, Illinois" and
> "Chicagoland", that's bad.
>
> 4. Completeness: Where you have data representing a clear finite set of
> referents, do you have them all? All the countries, all the states, all the
> NHL teams, etc? And if you have things related to these sets, are those
> projections complete? Populations of every country? Addresses of arenas of
> all the hockey teams?
>
> 5. Boundedness: Where you have data representing a clear finite set of
> referents, is it unpolluted by other things? E.g., can you get a list of
> current real countries, not mixed with former states or fictional empires or
> adminstrative subdivisions?
>
> 6. Typing: Do you really have properly typed nodes for things, or do you
> just have literals? The first president of the US was not "George
> Washington"^^xsd:string, it was a person whose name-renderings include
> "George Washington". Your ability to ask questions will be constrained or
> crippled if your data doesn't know the difference.
>
> 7. Modeling correctness: Is the logical structure of the data properly
> represented? Graphs are relational databases without the crutch of "rows";
> if you screw up the modeling, your queries will produce garbage.
>
> 8. Modeling granularity: Did you capture enough of the data to actually make
> use of it. ":us :president :george_washington" isn't exactly wrong, but it's
> pretty limiting. Model presidencies, with their dates, and you've got much
> more powerful data.
>
> 9. Connectedness: If you're bringing together datasets that used to be
> separate, are the join points represented properly. Is the US from your
> country list the same as (or owl:sameAs) the US from your list of
> presidencies and the US from your list of world cities and their
> populations?
>
> 10. Isomorphism: If you're bring together datasets that used to be separate,
> are their models reconciled? Does an album contain songs, or does it contain
> tracks which are publications of recordings of songs, or something else? If
> each data point answers this question differently, even simple-seeming
> queries may be intractable.
>
> 11. Currency: Is the data up-to-date?
>
> 12. Directionality: Can you navigate the logical binary relationships in
> either direction? Can you get from a country to its presidencies to their
> presidents, or do you have to know to only ask about presidents'
> presidencies' countries? Or worse, do you have to ask every question in
> permutations of directions because some data asserts things one way and some
> asserts it only the other?
>
> 13. Attribution: If your data comes from multiple sources, or in multiple
> batches, can you tell which came from where?
>
> 14. History: If your data has been edited, can you tell how and by whom?
>
> 15. Internal consistency: Do the populations of your counties add up to the
> populations of your states? Do the substitutes going into your soccer
> matches balance the substitutes going out?
>
>
> That's by no means an exhaustive list, and I didn't even start on the kinds
> of quality you can start talking about if you widen the scope of what you
> mean by "a dataset" to include the environment in which it's made available:
> performance, query repeatability, explorational fluidity, expressiveness of
> inquiry, analytic power, UI intelligibility, openness...
>


-- 
********************************************************

Deborah L. MacPherson CSI CCS, AIA
Specifications and Research Cannon Design
Projects Director, Accuracy&Aesthetics


The content of this email may contain private
and confidential information. Do not forward,
copy, share, or otherwise distribute without
explicit written permission from all correspondents.

********************************************************
Received on Tuesday, 12 April 2011 02:01:57 UTC