Re: text defining de-identified data from Rob Sherman on 2013-03-06 (public-tracking@w3.org from March 2013)

From: Rob Sherman <robsherman@fb.com>
Date: Wed, 6 Mar 2013 16:41:09 +0000
To: Dan Auerbach <dan@eff.org>, "public-tracking@w3.org" <public-tracking@w3.org>
Message-ID: <AD30EAA8DFF4B1498B95130E78C8325025F9F06C@PRN-MBX01-4.TheFacebook.com>
Dan,

Thanks so much for putting this draft text together.

One of the considerations that we discussed in our small group during the F2F, which I realize we may not have brought back to the full group, is that a benefit of relying on the FTC's established language is that as case law develops around the FTC standard there will be greater clarity about what de-identification is adequate — particularly as technology evolves.  If we use different operational language, people may not know exactly what might be sufficient to be a "reasonable measure to ensure that the data is de-identified" but not a "reasonable level of justified confidence that the data cannot be used to infer any information about or otherwise be linked to, a particular consumer, device or user agent," or vice versa, except that there is some daylight between the two.  This leads to less clarity in enforcing the standard and undermines one of the key benefits of relying on FTC language.

Recognizing that we all need to consider the broader implications of using the FTC definition in this context before deciding whether this is the right path, would you be willing to consider adopting the same standard as the FTC has concluded is appropriate?  If not, it would be helpful if you could explain what steps might satisfy one standard but not the other and how we will let people know what is a "reasonable measure" but does not create a "reasonable level of justified confidence."

One other thing that we discussed at the F2F was whether it is useful to have requirements in specific parts of the spec that people make certain assertions — apart from whatever server responses and other statements they must make.  The thinking is that if we decide as a group we need confirmations that a server is doing particular things, we should consolidate those into one signal.  The alternative is having a server that is substantively complying with the standard but is actually in noncompliance because it is missing certain "magic words."  Could we defer discussion of the public assertion of compliance to a global discussion about our draft more broadly, rather than as a one-off in the context of de-identification?

Rob

Rob Sherman
Facebook | Manager, Privacy and Public Policy
1155 F Street, NW Suite 475 | Washington, DC 20004
office 202.370.5147 | mobile 202.257.3901

From: Dan Auerbach <dan@eff.org<mailto:dan@eff.org>>
Date: Monday, March 4, 2013 12:23 PM
To: "public-tracking@w3.org<mailto:public-tracking@w3.org>" <public-tracking@w3.org<mailto:public-tracking@w3.org>>
Subject: text defining de-identified data
Resent-From: <public-tracking@w3.org<mailto:public-tracking@w3.org>>
Resent-Date: Monday, March 4, 2013 12:24 PM


Hi everyone,

I wanted to pass along some text regarding de-identification that Peter asked me prepare based largely on the FTC language that was discussed at the f2f, additionally including what I consider important non-normative text for guidance. I suspect the "example"/"remark" language within the non-normative text below are non-standard W3C terms, and am happy to take guidance from members of the group more familiar with W3C to amend this language appropriately to fit W3C style.

Best,
Dan

--

Normative text:

Data can be considered sufficiently de-identified to the extent that a company:

  1.  sufficiently deletes, scrubs, aggregates, anonymizes and otherwise manipulates the data in order to achieve a reasonable level of justified confidence that the data cannot be used to infer any information about, or otherwise be linked to, a particular consumer, device or user agent;

  2.  publicly commits not to try to re-identify the data, except in order to test the soundness of the de-identified data; and

  3.  contractually prohibits downstream recipients from trying to re-identify the data.

Non-normative text:

Example 1. Hashing a pseudonym such as a cookie string does NOT provide sufficient de-identification for an otherwise rich data set, since there are many ways to re-identify individuals based on pseudonymous data.

Example 2. In many cases, keeping only high-level aggregate data, such as the total number of visitors of a website each day broken down by country (discarding data from countries without many visitors) would be considered sufficiently de-identified.

Example 3. Deleting data is always a safe and easy way to achieve de-identification.

Remark 1. De-identification is a property of data. If data can be considered de-identified according to the “reasonable level of justified confidence” clause of (1), then no data manipulation process needs to take place in order to satisfy the requirements of (1).

Remark 2. There are a diversity of techniques being researched and developed to de-identify data sets (e.g. [1][2]), and companies are encouraged to explore and innovate new approaches to fit their needs.

Remark 3. It is a best practice for companies to perform “penetration testing” by having an expert with access to the data attempt to re-identify individuals or disclose attributes about them. The expert need not actually identify or disclose the attribute of an individual, but if the expert demonstrates how this could plausibly be achieved by joining the data set against other public data sets or private data sets accessible to the company, then the data set in question should no longer be considered sufficiently de-identified and changes should be made to provide stronger anonymization for the data set.

[1] https://research.microsoft.com/pubs/116123/dwork_cacm.pdf

[2] http://www.cs.purdue.edu/homes/ninghui/papers/t_closeness_icde07.pdf

--
Dan Auerbach
Staff Technologist
Electronic Frontier Foundation
dan@eff.org<mailto:dan@eff.org>
415 436 9333 x134
Received on Wednesday, 6 March 2013 16:41:40 UTC