Re: ACTION-371: text defining de-identified data from Roy T. Fielding on 2013-03-06 (public-tracking@w3.org from March 2013)

From: Roy T. Fielding <fielding@gbiv.com>
Date: Wed, 6 Mar 2013 15:10:37 -0700
To: Peter Swire <peter@peterswire.net>
Cc: Dan Auerbach <dan@eff.org>, "public-tracking@w3.org" <public-tracking@w3.org>
Message-Id: <7A0978F4-9075-4661-A3C2-BBAA06E6165F@gbiv.com>

On Mar 6, 2013, at 9:28 AM, Peter Swire forwarded:
> Normative text:
> Data can be considered sufficiently de-identified to the extent that a company:

A "company" has nothing to do with the state of the data.  This
definition needs to be phrased in terms of the data, not a process,
especially since a person doesn't need to be a company to collect data.

> sufficiently deletes, scrubs, aggregates, anonymizes and otherwise manipulates the data in order to achieve a reasonable level of justified confidence that the data cannot be used to infer any information about, or otherwise be linked to, a particular consumer, device or user agent;
> 
Scrubs is not a useful term.  I believe that "used to infer
any information about" is far too broad.  Anything useful in the
data is going to be information about a particular user even if we
cannot determine who that user might be, such as what browser was
used or what time the service was accessed.

What we care about preventing is the link to a particular user.
Including all of this other verbiage is just losing the point of
the definition and interfering with established best practice
with anonymous data.

> publicly commits not to try to re-identify the data, except in order to test the soundness of the de-identified data; and

This is not part of the definition.  We might add such a requirement
on processors, but it doesn't belong as the meaning of the term.

> contractually prohibits downstream recipients from trying to re-identify the data.
> 
This third bullet is not possible.  Please understand that de-identified
data includes such things as purely anonymous aggregate counts which
are then published openly.  It is absurd to suggest that contracts
are necessary (or even useful) to manage the output of deidentified
data -- any data that is de-identified is no longer in scope as a
concern for this standard.

My suggestion for a replacement is as follows:

  Data has been de-identified if it has been sufficiently deleted,
  modified, aggregated, anonymized, or otherwise manipulated in order
  to achieve a reasonable level of confidence that the remaining data
  is not and cannot be associated with a particular user, user agent,
  or device.

Cheers,

Roy T. Fielding                     <http://roy.gbiv.com/>
Senior Principal Scientist, Adobe   <https://www.adobe.com/>

Received on Wednesday, 6 March 2013 22:11:17 UTC