Re: Recommendations: specificity from Simon Spero on 2011-03-29 (public-lld@w3.org from March 2011)

From: Simon Spero <ses@unc.edu>
Date: Tue, 29 Mar 2011 18:05:27 -0400
To: "Diane I. Hillmann" <metadata.maven@gmail.com>
Cc: Jodi Schneider <jodi.schneider@deri.org>, Karen Coyle <kcoyle@kcoyle.net>, public-lld <public-lld@w3.org>
Message-ID: <AANLkTinwq2=ZXsaOAGXs88FCeCWqY8sh4R0eThgbMt7J@mail.gmail.com>

On Tue, Mar 29, 2011 at 10:28 AM, Diane I. Hillmann <
metadata.maven@gmail.com> wrote:

>
> As for (2), we should be talking about how unnecessary the whole idea of
> de-duplication becomes in a world where statement-level data is the norm.
>  The number and diversity of statements is important information when
> evaluating the usefulness of data, particularly in a machine environment.
>  If you have, for instance, 10 statements about the format of an item and 9
> of them agree, is that not useful? The duplication here supports the
> validity of those 9 statements that agree.

Diane-
  I think you may be misunderstanding Jodi's point, and your argument is may
have problems that show up under  certain plausible assumptions.

1.   If we define de-duplication as the process of determining whether two
entities are the same, that process is no simpler when dealing with triples
than when dealing with full length records. The number of properties that
need to be compared remains the same.

2:   Unless the subjects of multiple statements are known to refer to the
same thing, nothing can be assumed about 9 triples with the same value for
the same predicate versus 1 triple having a different value of the same
predicate.  However to assume that the subjects do refer to the same thing
is to beg the the question.

3:   Modern probabilistic record linkage techniques designed to generate
matches in the presence of errors rely on scoring based on weighted values
for matches and non-matches in a multiple different fields. Computing these
weights for entities represented as binary predicates requires re-gathering
all the values.  For a  very readable introduction to record linkage and
data quality written by leading experts in the field see Scheuren et al.
(2007).  Scheuren is a past president of the ASA, and specializes in the use
of statistics to support human rights. Winkler is head of research at the US
census bureau, and Herzog is stats guru for the FHA. Although some
mathematical techniques are covered, you can skip those parts and go
straight to the results and  business cases for why data quality matters.

4: Given that the subject of the statements has been established, simply
knowing that 9 out of the 10 assertions that you are aware of have the same
value, and 1 has a different value is by itself insufficient to decide that
the majority value is correct.
  -  The assertions may not be independent.
  -  The base frequency of the majority value may be so low that the
posterior probability  that the minority value is correct is still greater.
  - The statements may be deliberate falsehoods.

Without addressing the issues of provenance, etc, or without other reasons
to belief that the value is correct it is unwise to assume majority rules.

An excellent multidisciplinary review of the field of evidence and
probabilistic reasoning can be found in Schum (2001).  Not surprisingly,
Schum is a professor in both the law school and the [CS department] at GMU.

Simon

Scheuren, Fritz, William E Winkler, and Thomas Herzog (2007). Data quality
and record linkage techniques. New York: Springer. ISBN: 9780387695051.

Schum, D.A. (2001). The evidential foundations of probabilistic reasoning.
Northwestern Univ Pr. ISBN: 0810118211.

Received on Tuesday, 29 March 2011 22:06:00 UTC