- From: Simon Spero <ses@unc.edu>
- Date: Tue, 29 Mar 2011 18:05:27 -0400
- To: "Diane I. Hillmann" <metadata.maven@gmail.com>
- Cc: Jodi Schneider <jodi.schneider@deri.org>, Karen Coyle <kcoyle@kcoyle.net>, public-lld <public-lld@w3.org>
- Message-ID: <AANLkTinwq2=ZXsaOAGXs88FCeCWqY8sh4R0eThgbMt7J@mail.gmail.com>
On Tue, Mar 29, 2011 at 10:28 AM, Diane I. Hillmann < metadata.maven@gmail.com> wrote: > > As for (2), we should be talking about how unnecessary the whole idea of > de-duplication becomes in a world where statement-level data is the norm. > The number and diversity of statements is important information when > evaluating the usefulness of data, particularly in a machine environment. > If you have, for instance, 10 statements about the format of an item and 9 > of them agree, is that not useful? The duplication here supports the > validity of those 9 statements that agree. Diane- I think you may be misunderstanding Jodi's point, and your argument is may have problems that show up under certain plausible assumptions. 1. If we define de-duplication as the process of determining whether two entities are the same, that process is no simpler when dealing with triples than when dealing with full length records. The number of properties that need to be compared remains the same. 2: Unless the subjects of multiple statements are known to refer to the same thing, nothing can be assumed about 9 triples with the same value for the same predicate versus 1 triple having a different value of the same predicate. However to assume that the subjects do refer to the same thing is to beg the the question. 3: Modern probabilistic record linkage techniques designed to generate matches in the presence of errors rely on scoring based on weighted values for matches and non-matches in a multiple different fields. Computing these weights for entities represented as binary predicates requires re-gathering all the values. For a very readable introduction to record linkage and data quality written by leading experts in the field see Scheuren et al. (2007). Scheuren is a past president of the ASA, and specializes in the use of statistics to support human rights. Winkler is head of research at the US census bureau, and Herzog is stats guru for the FHA. Although some mathematical techniques are covered, you can skip those parts and go straight to the results and business cases for why data quality matters. 4: Given that the subject of the statements has been established, simply knowing that 9 out of the 10 assertions that you are aware of have the same value, and 1 has a different value is by itself insufficient to decide that the majority value is correct. - The assertions may not be independent. - The base frequency of the majority value may be so low that the posterior probability that the minority value is correct is still greater. - The statements may be deliberate falsehoods. Without addressing the issues of provenance, etc, or without other reasons to belief that the value is correct it is unwise to assume majority rules. An excellent multidisciplinary review of the field of evidence and probabilistic reasoning can be found in Schum (2001). Not surprisingly, Schum is a professor in both the law school and the [CS department] at GMU. Simon Scheuren, Fritz, William E Winkler, and Thomas Herzog (2007). Data quality and record linkage techniques. New York: Springer. ISBN: 9780387695051. Schum, D.A. (2001). The evidential foundations of probabilistic reasoning. Northwestern Univ Pr. ISBN: 0810118211.
Received on Tuesday, 29 March 2011 22:06:00 UTC