- From: Kingsley Idehen <kidehen@openlinksw.com>
- Date: Tue, 12 Apr 2011 08:58:06 -0400
- To: glenn mcdonald <gmcdonald@furia.com>
- CC: "public-lod@w3.org" <public-lod@w3.org>
On 4/8/11 9:10 PM, glenn mcdonald wrote: > I don't think data quality is an amorphous, aesthetic, hopelessly > subjective topic. Data "beauty" might be subjective, and the same data > may have different applicability to different tasks, but there are a > lot of obvious and straightforward ways of thinking about the quality > of a dataset independent of the particular preferences of individual > beholders. Here are just some of them: Glenn, I (and others) have no issue with data quality, we just understand (first hand) that when you have a masses of data from disparate sources, you discuss and iterate your way subjective sanity via constructive feedback loops. Summarily conflating source data quality with data access and presentation oriented tools is simply wrong, we all care about data quality, but nothing in the world nullifies the fact that "quality" is subjective. Is Excel rendered useless because a list of countries with obvious errors was presented in the spreadsheet? To an audience of Spreadsheet developers (programmers making a Spreadsheet product) that's irrelevant, to the accounts or marketing department of Spreadsheet product customers (actual users doing their jobs) that's important, but it has nothing to do with the Spreadsheet product itself. Same analogy would apply to any DBMS product. You have to separate the parts is the message I keep on trying to relay to you i.e., stop conflating matters in an unnecessarily disruptive way. Back to data quality discussion: Subjectively low quality data can lead to subjectively higher quality data. Without data all you have is an empty space. Using any form of "all or nothing" proposition in a subjective realm is fatally flawed. How would you address data quality issues in situations where data producers, data shape, data consumers, and data presentation tools are all loosely coupled ? Bearing in mind your issues with DBpedia and other datasets from the LOD cloud, are contributions of quality data from you out of the question re., virtuous cycle that's oriented towards subjectively improved quality? I've already made it clear to you that DBpedia contributions are welcome, they trump gripping any day, and you would actually be quite surprised as to what kind discourse clarity said contributions would unveil. Thus, why don't you call my bluff by producing and sharing a "data quality" linkset for the LOD cloud? Note FAO, SUMO, Yago, UMBEL, OpenCyc communities have all contributed data to the LOD cloud that enable application of their context lenses to linked open data spaces like DBpedia. I spend a lot of time behind the scenes working with a variety of people on the very subject of data quality, linkset partitioning via named graphs, and conditional application of inference contexts via the combination of rules and reasoners. Unfortunately, you are so bent on obliterating the start of conversations that you don't even recognize different routes to the same destination. As for reconciling a common Referent for multiple Identifiers in a Linked Data space comprised of 21 Billion+ triples, lets take a look at the subject: Michael Jackson . 1. http://lod.openlinksw.com/describe/?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson -- basic description of 'Micheal Jackson' from DBpedia 2. http://lod.openlinksw.com/fct/rdfdesc/usage.vsp?g=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson -- list of source named graphs in the host DBMS 3. http://lod.openlinksw.com/fct/rdfdesc/usage.vsp?g=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson&tp=2 -- list of named graphs with triples that reference this subject 4. http://lod.openlinksw.com/fct/rdfdesc/usage.vsp?g=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson&tp=3 -- explicit owl:sameAs relations across the entire DBMS (clicking on each Identifier will unveil the description graph for the Referent of said Identifier) 5. http://lod.openlinksw.com/fct/rdfdesc/usage.vsp?g=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson&tp=4 -- use of an InverseFunctionalProperty based rule to generate a fuzzy list of Identifiers that potentially share the same Referent (click on each link as per prior step) 6. http://lod.openlinksw.com/describe/?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson&sas=yes -- inference context enhanced description of 'Micheal Jackson' (this is a union expansion of all properties across all Identifiers in an owl:sameAs relation with DBpedia Entity, hence use of paging re. handling result set size.) 7. http://lod.openlinksw.com/describe/?url=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson&sas=yes&p=6&lp=7&op=4&prev=&gp=6 - Page 5 of 8 re. enhanced description of 'Micheal Jackson' . Steps 1-7 can provide many insights about data the aid subjective quality fixes via simple protocols such as consumer notifying publisher and in the very worst of cases (agreeing to disagree) the consumer makes a linkset, passes it on to the producer, and the producer reciprocates by uploading the linkset to a named graph and they also publishes a named rule such that when consumer next visits they are able to apply their subjective "context lenses" to the data via inference rules. All of this happens without imposing 'world views' on any other consumers of the data who's needs by vary, subjectively. The process I outline above is something we do regularly re. the datasets hosted in the public instances we oversee. Its why we actually have a number of demo rules etc.. Accepting the complexity of subjectivity when the audience diversity is integral to a system != ignoring or dismissing the value of data quality. I just also happen to have hands on experience dealing this problem and its inherently subjectivity. To conclude, your quality factors aren't invalid, the real challenge and question for you is this: how do you cater for this at InterWeb scale bearing in mind audience heterogeneity? -- Regards, Kingsley Idehen President& CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen
Received on Tuesday, 12 April 2011 12:58:29 UTC