text defining de-identified data from Dan Auerbach on 2013-03-04 (public-tracking@w3.org from March 2013)

From: Dan Auerbach <dan@eff.org>
Date: Mon, 04 Mar 2013 09:23:28 -0800
To: "public-tracking@w3.org" <public-tracking@w3.org>
Message-ID: <5134D890.5040408@eff.org>
Hi everyone,

I wanted to pass along some text regarding de-identification that Peter
asked me prepare based largely on the FTC languagethat was discussed at
the f2f, additionally includingwhat I consider important non-normative
text for guidance. Isuspect the "example"/"remark" language within the
non-normative text below are non-standard W3C terms, and am happy to
take guidance from members of the group more familiar with W3C to amend
this language appropriatelyto fit W3C style.

Best,
Dan

--

Normative text:

Data can be considered sufficiently de-identified to the extent that a
company:

 1.

    sufficiently deletes, scrubs, aggregates, anonymizes and otherwise
    manipulates the data in order to achieve a reasonable level of
    justified confidence that the data cannot be used to infer any
    information about, or otherwise be linked to, a particular consumer,
    device or user agent;

 2.

    publicly commits not to try to re-identify the data, except in order
    to test the soundness of the de-identified data; and

 3.

    contractually prohibits downstream recipients from trying to
    re-identify the data.

Non-normative text:

Example 1. Hashing a pseudonym such as a cookie string does NOT provide
sufficient de-identification for an otherwise rich data set, since there
are many ways to re-identify individuals based on pseudonymous data.

Example 2. In many cases, keeping only high-level aggregate data, such
as the total number of visitors of a website each day broken down by
country (discarding data from countries without many visitors) would be
considered sufficiently de-identified.

Example 3. Deleting data is always a safe and easy way to achieve
de-identification.

Remark 1. De-identification is a property of data. If data can be
considered de-identified according to the "reasonable level of justified
confidence" clause of (1), then no data manipulation process needs to
take place in order to satisfy the requirements of (1).

Remark 2. There are a diversity of techniques being researched and
developed to de-identify data sets (e.g. [1][2]), and companies are
encouraged to explore and innovate new approaches to fit their needs.

Remark 3. It is a best practice for companies to perform "penetration
testing" by having an expert with access to the data attempt to
re-identify individuals or disclose attributes about them. The expert
need not actually identify or disclose the attribute of an individual,
but if the expert demonstrates how this could plausibly be achieved by
joining the data set against other public data sets or private data sets
accessible to the company, then the data set in question should no
longer be considered sufficiently de-identified and changes should be
made to provide stronger anonymization for the data set.

[1] https://research.microsoft.com/pubs/116123/dwork_cacm.pdf

[2] http://www.cs.purdue.edu/homes/ninghui/papers/t_closeness_icde07.pdf

-- 
Dan Auerbach
Staff Technologist
Electronic Frontier Foundation
dan@eff.org
415 436 9333 x134
Received on Monday, 4 March 2013 17:24:05 UTC