Re: 15 Ways to Think About Data Quality (Just for a Start)

Hi Glenn,

This reminds me some established frameworks.
Here is a list of criteria gathered from the literature for metadata quality
[1]. It is not exhaustive. Besiki Svitlia has also worked on  a more
comprehensive framework [2]. More has been done on information quality in
general. However I guess they do not cover all aspects you mentioned, in
particular, in relation to the ontology used and the linkage aspects for
instance.

*Completeness* In a complete metadata record, the learning object is
described using all the fields that are relevant to describe it. *
Accuracy*In an accurate metadata record, the data contained in the
fields correspond
to the object that is being described. * Provenance* The provenance
parameter reflects the degree of trust that you have in the creator of the
metadata record. *Conformance to expectations* This parameter measure how
well the data contained in the record let you gain knowledge about the
learning object without actually seeing the object *Logical consistency and
coherence* This parameter reflects two measures: The consistency measures if
the values chosen for different fields in the record agree between them.
Coherence measures if all the fields talk about the same object
*Timeliness*This parameter measure how up-to-date the metadata record
is compared with
changes in the object *Accessibility* This parameter measures how well you
are able to understand the content of the metadata record Muriel Foulonneau

[1] Thomas R. Bruce and Diane I. Hillman 'The Continuum of Metadata Quality'

[2]
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.89.8053&rep=rep1&type=pdf

On Sat, Apr 9, 2011 at 3:10 AM, glenn mcdonald <gmcdonald@furia.com> wrote:

> I don't think data quality is an amorphous, aesthetic, hopelessly
> subjective topic. Data "beauty" might be subjective, and the same data may
> have different applicability to different tasks, but there are a lot of
> obvious and straightforward ways of thinking about the quality of a dataset
> independent of the particular preferences of individual beholders. Here are
> just some of them:
>
> 1. Accuracy: Are the individual nodes that refer to factual information
> factually and lexically correct. Like, is Chicago spelled "Chigaco" or does
> the dataset say its population is 2.7?
>
> 2. Intelligibility: Are there human-readable labels on things, so you can
> tell what a thing is when you're looking at? Is there a model, so you can
> tell what questions you can ask? If a thing has multiple labels (or a set of
> owl:sameAs things havemlutiple labels), do you know which (or if) one is
> canonical?
>
> 3. Referential correspondence: If a set of data points represents some set
> of real-world referents, is there one and only one point per referent? If
> you have 9,780 data points representing cities, but 5 of them are "Chicago",
> "Chicago, IL", "Metro Chicago", "Metropolitain Chicago, Illinois" and
> "Chicagoland", that's bad.
>
> 4. Completeness: Where you have data representing a clear finite set of
> referents, do you have them all? All the countries, all the states, all the
> NHL teams, etc? And if you have things related to these sets, are those
> projections complete? Populations of every country? Addresses of arenas of
> all the hockey teams?
>
> 5. Boundedness: Where you have data representing a clear finite set of
> referents, is it unpolluted by other things? E.g., can you get a list of
> current real countries, not mixed with former states or fictional empires or
> adminstrative subdivisions?
>
> 6. Typing: Do you really have properly typed nodes for things, or do you
> just have literals? The first president of the US was not "George
> Washington"^^xsd:string, it was a person whose name-renderings include
> "George Washington". Your ability to ask questions will be constrained or
> crippled if your data doesn't know the difference.
>
> 7. Modeling correctness: Is the logical structure of the data properly
> represented? Graphs are relational databases without the crutch of "rows";
> if you screw up the modeling, your queries will produce garbage.
>
> 8. Modeling granularity: Did you capture enough of the data to actually
> make use of it. ":us :president :george_washington" isn't exactly wrong, but
> it's pretty limiting. Model presidencies, with their dates, and you've got
> much more powerful data.
>
> 9. Connectedness: If you're bringing together datasets that used to be
> separate, are the join points represented properly. Is the US from your
> country list the same as (or owl:sameAs) the US from your list of
> presidencies and the US from your list of world cities and their
> populations?
>
> 10. Isomorphism: If you're bring together datasets that used to be
> separate, are their models reconciled? Does an album contain songs, or does
> it contain tracks which are publications of recordings of songs, or
> something else? If each data point answers this question differently, even
> simple-seeming queries may be intractable.
>
> 11. Currency: Is the data up-to-date?
>
> 12. Directionality: Can you navigate the logical binary relationships in
> either direction? Can you get from a country to its presidencies to their
> presidents, or do you have to know to only ask about presidents'
> presidencies' countries? Or worse, do you have to ask every question in
> permutations of directions because some data asserts things one way and some
> asserts it only the other?
>
> 13. Attribution: If your data comes from multiple sources, or in multiple
> batches, can you tell which came from where?
>
> 14. History: If your data has been edited, can you tell how and by whom?
>
> 15. Internal consistency: Do the populations of your counties add up to the
> populations of your states? Do the substitutes going into your soccer
> matches balance the substitutes going out?
>
>
> That's by no means an exhaustive list, and I didn't even start on the kinds
> of quality you can start talking about if you widen the scope of what you
> mean by "a dataset" to include the environment in which it's made available:
> performance, query repeatability, explorational fluidity, expressiveness of
> inquiry, analytic power, UI intelligibility, openness...
>

Received on Tuesday, 12 April 2011 08:01:20 UTC