- From: Deborah MacPherson <debmacp@gmail.com>
- Date: Mon, 11 Apr 2011 22:01:29 -0400
- To: glenn mcdonald <gmcdonald@furia.com>
- Cc: "public-lod@w3.org" <public-lod@w3.org>
The geographic/cartographic examples are perfect. Every service level could benefit from higher quality linked data Deborah MacPherson On 4/8/11, glenn mcdonald <gmcdonald@furia.com> wrote: > I don't think data quality is an amorphous, aesthetic, hopelessly subjective > topic. Data "beauty" might be subjective, and the same data may have > different applicability to different tasks, but there are a lot of obvious > and straightforward ways of thinking about the quality of a dataset > independent of the particular preferences of individual beholders. Here are > just some of them: > > 1. Accuracy: Are the individual nodes that refer to factual information > factually and lexically correct. Like, is Chicago spelled "Chigaco" or does > the dataset say its population is 2.7? > > 2. Intelligibility: Are there human-readable labels on things, so you can > tell what a thing is when you're looking at? Is there a model, so you can > tell what questions you can ask? If a thing has multiple labels (or a set of > owl:sameAs things havemlutiple labels), do you know which (or if) one is > canonical? > > 3. Referential correspondence: If a set of data points represents some set > of real-world referents, is there one and only one point per referent? If > you have 9,780 data points representing cities, but 5 of them are "Chicago", > "Chicago, IL", "Metro Chicago", "Metropolitain Chicago, Illinois" and > "Chicagoland", that's bad. > > 4. Completeness: Where you have data representing a clear finite set of > referents, do you have them all? All the countries, all the states, all the > NHL teams, etc? And if you have things related to these sets, are those > projections complete? Populations of every country? Addresses of arenas of > all the hockey teams? > > 5. Boundedness: Where you have data representing a clear finite set of > referents, is it unpolluted by other things? E.g., can you get a list of > current real countries, not mixed with former states or fictional empires or > adminstrative subdivisions? > > 6. Typing: Do you really have properly typed nodes for things, or do you > just have literals? The first president of the US was not "George > Washington"^^xsd:string, it was a person whose name-renderings include > "George Washington". Your ability to ask questions will be constrained or > crippled if your data doesn't know the difference. > > 7. Modeling correctness: Is the logical structure of the data properly > represented? Graphs are relational databases without the crutch of "rows"; > if you screw up the modeling, your queries will produce garbage. > > 8. Modeling granularity: Did you capture enough of the data to actually make > use of it. ":us :president :george_washington" isn't exactly wrong, but it's > pretty limiting. Model presidencies, with their dates, and you've got much > more powerful data. > > 9. Connectedness: If you're bringing together datasets that used to be > separate, are the join points represented properly. Is the US from your > country list the same as (or owl:sameAs) the US from your list of > presidencies and the US from your list of world cities and their > populations? > > 10. Isomorphism: If you're bring together datasets that used to be separate, > are their models reconciled? Does an album contain songs, or does it contain > tracks which are publications of recordings of songs, or something else? If > each data point answers this question differently, even simple-seeming > queries may be intractable. > > 11. Currency: Is the data up-to-date? > > 12. Directionality: Can you navigate the logical binary relationships in > either direction? Can you get from a country to its presidencies to their > presidents, or do you have to know to only ask about presidents' > presidencies' countries? Or worse, do you have to ask every question in > permutations of directions because some data asserts things one way and some > asserts it only the other? > > 13. Attribution: If your data comes from multiple sources, or in multiple > batches, can you tell which came from where? > > 14. History: If your data has been edited, can you tell how and by whom? > > 15. Internal consistency: Do the populations of your counties add up to the > populations of your states? Do the substitutes going into your soccer > matches balance the substitutes going out? > > > That's by no means an exhaustive list, and I didn't even start on the kinds > of quality you can start talking about if you widen the scope of what you > mean by "a dataset" to include the environment in which it's made available: > performance, query repeatability, explorational fluidity, expressiveness of > inquiry, analytic power, UI intelligibility, openness... > -- ******************************************************** Deborah L. MacPherson CSI CCS, AIA Specifications and Research Cannon Design Projects Director, Accuracy&Aesthetics The content of this email may contain private and confidential information. Do not forward, copy, share, or otherwise distribute without explicit written permission from all correspondents. ********************************************************
Received on Tuesday, 12 April 2011 02:01:57 UTC