de-identification text for Wednesday's call

Hi everyone,

Given that de-identification is on the agenda for Wednesday, I wanted to
send out the current state of the de-identification text. No changes to
normative text were made since the ending point of the last email
thread. I made some small tweaks in order to tighten up the
non-normative language, though nothing has conceptually changed.

We are also putting a pin in the issue of requirements and commitments
that a DNT-compliant entity must make with respect to de-identification.
I think such a specific commitment is warranted, but we agreed to have
that discussion separately.

Thanks again to everyone for the feedback,
Dan

Normative text:

Data can be considered sufficiently de-identified to the extent that it
has been deleted, modified, aggregated, anonymized or otherwise
manipulated in order to achieve a reasonable level of justified
confidence that the data cannot reasonably be used to infer information
about, or otherwise be linked to, a particular user, user agent, or device.

Non-normative text:

Example 1. In general, using unique or near-unique pseudonymous
identifiers to link records of a particular user, user agent, or device
within a large data set does NOT provide sufficient de-identification.
Even absent obvious identifiers such as names, email addresses, or zip
codes, there are many ways to gain information about individuals based
on pseudonymous data.

Example 2. In general, keeping only high-level aggregate data across a
small number of dimensions, such as the total number of visitors of a
website each day broken down by country (discarding data from countries
without many visitors), would be considered sufficiently de-identified.

Example 3. Deleting data is always a safe and easy way to achieve
de-identification.

Remark 1. De-identification is a property of data. If data can be
considered de-identified according to the "reasonable level of justified
confidence" clause of (1), then no data manipulation process needs to
take place in order to satisfy the requirements of (1).

Remark 2. There are a diversity of techniques being researched and
developed to de-identify data sets [1][2], and companies are encouraged
to explore and innovate new approaches to fit their needs.

Remark 3. It is a best practice for companies to perform "privacy
penetration testing" by having an expert with access to the data attempt
to re-identify individuals or disclose attributes about them. The expert
need not actually identify or disclose the attribute of an individual,
but if the expert demonstrates how this could plausibly be achieved by
joining the data set against other public data sets or private data sets
accessible to the company, then the data set in question should no
longer be considered sufficiently de-identified and changes should be
made to provide stronger anonymization for the data set.

[1] https://research.microsoft.com/pubs/116123/dwork_cacm.pdf

[2] http://www.cs.purdue.edu/homes/ninghui/papers/t_closeness_icde07.pdf



-- 
Dan Auerbach
Staff Technologist
Electronic Frontier Foundation
dan@eff.org
415 436 9333 x134

Received on Tuesday, 2 April 2013 08:21:45 UTC