Re: anonymous or no? from Richard Barnes on 2010-08-13 (public-privacy@w3.org from July to September 2010)

From: Richard Barnes <richard.barnes@gmail.com>
Date: Fri, 13 Aug 2010 15:07:37 -0400
To: David Singer <singer@apple.com>
Cc: public-privacy@w3.org
Message-ID: <AANLkTikK0Y7Y+21dZZF1VrxBdGPqOuMrczVpLKDrvWr8@mail.gmail.com>

David,

In principle, I think you're exactly right that re-identification can
be a big problem, especially with the rich data sets that many
organizations are collecting nowadays.  (Our position paper [1]
touches on this issue briefly, in a slightly different context.)

As I understand it, however, (and I'm certainly not an expert) the
challenge for making/implementing policy with regard to
re-identification is that the mathematics are a little subtle and very
dependent on they types of data and the underlying population
distributions.  There's a fairly large body of work on how to do
anonymization in specific domains (e.g., the techniques applied at the
Census Bureau [2]), but I'm not aware of a general enough methodology
to cover the diversity of data collected by entities in the Web.
(Again, not an expert!)

The additional challenge given the availability of some public data
sets is that it's not always possible for the maintainer of a data set
to know what additional data a recipient might combine with that data
set.  A demographics provider such as Feeva may only provide
information to ZIP code granularity, but if a third party analyst also
knows a user's gender and date of birth, then you're back in the
classical re-identification regime.

I'm not sure that all this means that it's completely impossible to
have any policies about re-identification, but you might have to
constrain the scope of what you try to achieve.  The fusion problem,
in particular, seems kind of insurmountable to me.

--Richard


[1] <http://www.w3.org/2010/api-privacy-ws/papers/privacy-ws-35.pdf>
[2] <http://lehd.did.census.gov/led/datatools/onthemap3.html>



On Aug 11, 2010 2:50 PM, "David Singer" <singer@apple.com> wrote:

This is a 'discussion point'...I'm not even sure I can express it very
well, but I think it worth raising.

Imagine I interact with a web service and my agreement with them is
that any data that is collected 'about' me is anonymized, so that I am
not personally identifiable in the database of records they build.
They respect that agreement, but make the database available for
analysis etc.

But now, as we know, people are getting very good at
re-identification.  Clearly I don't like it if someone says "I'm 95%
sure that the guy who bought these five books, is that Dave Singer who
attends the W3C".  I'd like to say "not only must my records be
anonymized, but re-identification should not occur either".

But this flies directly in the face of a very long-established
principle, that the analysis and drawing of conclusions from public
data is a legitimate, indeed even intended, usage of that public data.
 And setting that rule would also drive re-identification
"underground" -- people would still do it, they just wouldn't publish
the results, which is *worse*.

The best I can think of is to make sure any policy/rule about
disclosure/warning applies to personally identifiable data *whether or
not the identification was original or deductive*, but it doesn't feel
ideal.  In particular, the party doing the analysis may have no link
(business relationship etc.) with me at all. How would they disclose
to me that they have deduced identifiable data? Under what incentive
would they do that, anyway?

Thoughts?

David Singer
Multimedia and Software Standards, Apple Inc.

Received on Friday, 13 August 2010 19:09:35 UTC