- From: Richard Barnes <richard.barnes@gmail.com>
- Date: Fri, 13 Aug 2010 15:07:37 -0400
- To: David Singer <singer@apple.com>
- Cc: public-privacy@w3.org
David, In principle, I think you're exactly right that re-identification can be a big problem, especially with the rich data sets that many organizations are collecting nowadays. (Our position paper [1] touches on this issue briefly, in a slightly different context.) As I understand it, however, (and I'm certainly not an expert) the challenge for making/implementing policy with regard to re-identification is that the mathematics are a little subtle and very dependent on they types of data and the underlying population distributions. There's a fairly large body of work on how to do anonymization in specific domains (e.g., the techniques applied at the Census Bureau [2]), but I'm not aware of a general enough methodology to cover the diversity of data collected by entities in the Web. (Again, not an expert!) The additional challenge given the availability of some public data sets is that it's not always possible for the maintainer of a data set to know what additional data a recipient might combine with that data set. A demographics provider such as Feeva may only provide information to ZIP code granularity, but if a third party analyst also knows a user's gender and date of birth, then you're back in the classical re-identification regime. I'm not sure that all this means that it's completely impossible to have any policies about re-identification, but you might have to constrain the scope of what you try to achieve. The fusion problem, in particular, seems kind of insurmountable to me. --Richard [1] <http://www.w3.org/2010/api-privacy-ws/papers/privacy-ws-35.pdf> [2] <http://lehd.did.census.gov/led/datatools/onthemap3.html> On Aug 11, 2010 2:50 PM, "David Singer" <singer@apple.com> wrote: This is a 'discussion point'...I'm not even sure I can express it very well, but I think it worth raising. Imagine I interact with a web service and my agreement with them is that any data that is collected 'about' me is anonymized, so that I am not personally identifiable in the database of records they build. They respect that agreement, but make the database available for analysis etc. But now, as we know, people are getting very good at re-identification. Clearly I don't like it if someone says "I'm 95% sure that the guy who bought these five books, is that Dave Singer who attends the W3C". I'd like to say "not only must my records be anonymized, but re-identification should not occur either". But this flies directly in the face of a very long-established principle, that the analysis and drawing of conclusions from public data is a legitimate, indeed even intended, usage of that public data. And setting that rule would also drive re-identification "underground" -- people would still do it, they just wouldn't publish the results, which is *worse*. The best I can think of is to make sure any policy/rule about disclosure/warning applies to personally identifiable data *whether or not the identification was original or deductive*, but it doesn't feel ideal. In particular, the party doing the analysis may have no link (business relationship etc.) with me at all. How would they disclose to me that they have deduced identifiable data? Under what incentive would they do that, anyway? Thoughts? David Singer Multimedia and Software Standards, Apple Inc.
Received on Friday, 13 August 2010 19:09:35 UTC