Re: ACTION-371: text defining de-identified data from Dan Auerbach on 2013-03-06 (public-tracking@w3.org from March 2013)

From: Dan Auerbach <dan@eff.org>
Date: Wed, 06 Mar 2013 15:43:36 -0800
To: public-tracking@w3.org
Message-ID: <5137D4A8.7040403@eff.org>
On 03/06/2013 02:10 PM, Roy T. Fielding wrote:
> On Mar 6, 2013, at 9:28 AM, Peter Swire forwarded:
>>
>> Normative text:
>>
>> Data can be considered sufficiently de-identified to the extent that
>> a company:
>>
>
> A "company" has nothing to do with the state of the data.  This
> definition needs to be phrased in terms of the data, not a process,
> especially since a person doesn't need to be a company to collect data.

I agree with this, though the FTC definition is in terms of "a company"
and it sounds like many in the WG think that there is value to sticking
close to this definition in order to unify the standard as much as
possible with existing language. But I will try to remove it, while
keeping the meat of the FTC def.

>
>> 1.
>>
>>     sufficiently deletes, scrubs, aggregates, anonymizes and
>>     otherwise manipulates the data in order to achieve a reasonable
>>     level of justified confidence that the data cannot be used to
>>     infer any information about, or otherwise be linked to, a
>>     particular consumer, device or user agent;
>>
>>
> Scrubs is not a useful term.  I believe that "used to infer
> any information about" is far too broad.  Anything useful in the
> data is going to be information about a particular user even if we
> cannot determine who that user might be, such as what browser was
> used or what time the service was accessed.
>
> What we care about preventing is the link to a particular user.
> Including all of this other verbiage is just losing the point of
> the definition and interfering with established best practice
> with anonymous data.
I am happy to remove "scrubs", and also following Rob and Shane's
suggestion will remove the word "any" to match the FTC definition.
However, I disagree about "used to infer information about". I consider
this a strength of the FTC definition. As has Ed rightly pointed out at
the DC workshop on unlinkability, users care about attribute disclosure,
not only re-identification. For example, if I know that at least one of
ten requests that I see to my web server for the URL
https://example.com/embarrassing had to come from you, but I do not know
which one belongs to you, then I have no re-identified you from my data
set, but I know that you visited https://example.com/embarrassing.


>
>> 2.
>>
>>     publicly commits not to try to re-identify the data, except in
>>     order to test the soundness of the de-identified data; and
>>
>
> This is not part of the definition.  We might add such a requirement
> on processors, but it doesn't belong as the meaning of the term.

Rob S also brought this up and I'm happy to remove it here, provided we
include such a requirement elsewhere to match the FTC language as
closely as possible.

>
>> 3.
>>
>>     contractually prohibits downstream recipients from trying to
>>     re-identify the data.
>>
>>
> This third bullet is not possible.  Please understand that de-identified
> data includes such things as purely anonymous aggregate counts which
> are then published openly.  It is absurd to suggest that contracts
> are necessary (or even useful) to manage the output of deidentified
> data -- any data that is de-identified is no longer in scope as a
> concern for this standard.
I think de-identification is incredible hard. Even data that at first
blush you might consider to be totally anonymized could lend itself to
re-anonymization or attribute disclosure attacks. AFAIK there is no
formal mechanism for proving an anonymized dataset is impervious to
re-anonymization attacks (short of deletion). Given this dearth of
theoretical understanding on this very new issue, we want to give
organizations leeway to keep some data if they do a rigorous
anonymization job, but we don't want to tie their hands so much that
deletion is the only option.

That leaves us in a place where this aggregated data realistically could
be re-identified via some sort of clever attack. For some data sets, for
example Google's recently released generalized NSL numbers
(https://www.google.com/transparencyreport/userdatarequests/US/), the
total number of bits of information is small enough that I think a
contract is not necessary. However, I don't think it at all absurd to
create this contractual obligation for larger data sets, and think it
exists in other contexts such as HIPAA. Perhaps we could add a clause
that distinguishes the two situations. Happy to consider this for the
next draft.


>
> My suggestion for a replacement is as follows:
>
>   Data has been de-identified if it has been sufficiently deleted,
>   modified, aggregated, anonymized, or otherwise manipulated in order
>   to achieve a reasonable level of confidence that the remaining data
>   is not and cannot be associated with a particular user, user agent,
>   or device.

Thanks. I think there are reasons for sticking closer to the FTC
definition, and the group seemed to gravitate towards that definition at
the f2f. But I will work on a draft that incorporates your suggestions.


>
> Cheers,
>
> Roy T. Fielding                     <http://roy.gbiv.com/>
> Senior Principal Scientist, Adobe   <https://www.adobe.com/>
>


-- 
Dan Auerbach
Staff Technologist
Electronic Frontier Foundation
dan@eff.org
415 436 9333 x134
Received on Wednesday, 6 March 2013 23:44:05 UTC