- From: Muriel Foulonneau <muriel.foulonneau@gmail.com>
- Date: Tue, 12 Apr 2011 10:00:47 +0200
- To: glenn mcdonald <gmcdonald@furia.com>
- Cc: "public-lod@w3.org" <public-lod@w3.org>
- Message-ID: <BANLkTik+x+6AZ1mF-w2Dam4=AdTwwB1iNA@mail.gmail.com>
Hi Glenn, This reminds me some established frameworks. Here is a list of criteria gathered from the literature for metadata quality [1]. It is not exhaustive. Besiki Svitlia has also worked on a more comprehensive framework [2]. More has been done on information quality in general. However I guess they do not cover all aspects you mentioned, in particular, in relation to the ontology used and the linkage aspects for instance. *Completeness* In a complete metadata record, the learning object is described using all the fields that are relevant to describe it. * Accuracy*In an accurate metadata record, the data contained in the fields correspond to the object that is being described. * Provenance* The provenance parameter reflects the degree of trust that you have in the creator of the metadata record. *Conformance to expectations* This parameter measure how well the data contained in the record let you gain knowledge about the learning object without actually seeing the object *Logical consistency and coherence* This parameter reflects two measures: The consistency measures if the values chosen for different fields in the record agree between them. Coherence measures if all the fields talk about the same object *Timeliness*This parameter measure how up-to-date the metadata record is compared with changes in the object *Accessibility* This parameter measures how well you are able to understand the content of the metadata record Muriel Foulonneau [1] Thomas R. Bruce and Diane I. Hillman 'The Continuum of Metadata Quality' [2] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.89.8053&rep=rep1&type=pdf On Sat, Apr 9, 2011 at 3:10 AM, glenn mcdonald <gmcdonald@furia.com> wrote: > I don't think data quality is an amorphous, aesthetic, hopelessly > subjective topic. Data "beauty" might be subjective, and the same data may > have different applicability to different tasks, but there are a lot of > obvious and straightforward ways of thinking about the quality of a dataset > independent of the particular preferences of individual beholders. Here are > just some of them: > > 1. Accuracy: Are the individual nodes that refer to factual information > factually and lexically correct. Like, is Chicago spelled "Chigaco" or does > the dataset say its population is 2.7? > > 2. Intelligibility: Are there human-readable labels on things, so you can > tell what a thing is when you're looking at? Is there a model, so you can > tell what questions you can ask? If a thing has multiple labels (or a set of > owl:sameAs things havemlutiple labels), do you know which (or if) one is > canonical? > > 3. Referential correspondence: If a set of data points represents some set > of real-world referents, is there one and only one point per referent? If > you have 9,780 data points representing cities, but 5 of them are "Chicago", > "Chicago, IL", "Metro Chicago", "Metropolitain Chicago, Illinois" and > "Chicagoland", that's bad. > > 4. Completeness: Where you have data representing a clear finite set of > referents, do you have them all? All the countries, all the states, all the > NHL teams, etc? And if you have things related to these sets, are those > projections complete? Populations of every country? Addresses of arenas of > all the hockey teams? > > 5. Boundedness: Where you have data representing a clear finite set of > referents, is it unpolluted by other things? E.g., can you get a list of > current real countries, not mixed with former states or fictional empires or > adminstrative subdivisions? > > 6. Typing: Do you really have properly typed nodes for things, or do you > just have literals? The first president of the US was not "George > Washington"^^xsd:string, it was a person whose name-renderings include > "George Washington". Your ability to ask questions will be constrained or > crippled if your data doesn't know the difference. > > 7. Modeling correctness: Is the logical structure of the data properly > represented? Graphs are relational databases without the crutch of "rows"; > if you screw up the modeling, your queries will produce garbage. > > 8. Modeling granularity: Did you capture enough of the data to actually > make use of it. ":us :president :george_washington" isn't exactly wrong, but > it's pretty limiting. Model presidencies, with their dates, and you've got > much more powerful data. > > 9. Connectedness: If you're bringing together datasets that used to be > separate, are the join points represented properly. Is the US from your > country list the same as (or owl:sameAs) the US from your list of > presidencies and the US from your list of world cities and their > populations? > > 10. Isomorphism: If you're bring together datasets that used to be > separate, are their models reconciled? Does an album contain songs, or does > it contain tracks which are publications of recordings of songs, or > something else? If each data point answers this question differently, even > simple-seeming queries may be intractable. > > 11. Currency: Is the data up-to-date? > > 12. Directionality: Can you navigate the logical binary relationships in > either direction? Can you get from a country to its presidencies to their > presidents, or do you have to know to only ask about presidents' > presidencies' countries? Or worse, do you have to ask every question in > permutations of directions because some data asserts things one way and some > asserts it only the other? > > 13. Attribution: If your data comes from multiple sources, or in multiple > batches, can you tell which came from where? > > 14. History: If your data has been edited, can you tell how and by whom? > > 15. Internal consistency: Do the populations of your counties add up to the > populations of your states? Do the substitutes going into your soccer > matches balance the substitutes going out? > > > That's by no means an exhaustive list, and I didn't even start on the kinds > of quality you can start talking about if you widen the scope of what you > mean by "a dataset" to include the environment in which it's made available: > performance, query repeatability, explorational fluidity, expressiveness of > inquiry, analytic power, UI intelligibility, openness... >
Received on Tuesday, 12 April 2011 08:01:20 UTC