Re: ACTION-371: text defining de-identified data from Roy T. Fielding on 2013-03-07 (public-tracking@w3.org from March 2013)

From: Roy T. Fielding <fielding@gbiv.com>
Date: Wed, 6 Mar 2013 19:46:43 -0700
To: Dan Auerbach <dan@eff.org>
Cc: public-tracking@w3.org
Message-Id: <2071B30A-86C8-47F3-8DC6-2475537513BB@gbiv.com>
On Mar 6, 2013, at 4:43 PM, Dan Auerbach wrote:
> On 03/06/2013 02:10 PM, Roy T. Fielding wrote:
>> What we care about preventing is the link to a particular user.
>> Including all of this other verbiage is just losing the point of
>> the definition and interfering with established best practice
>> with anonymous data.
> I am happy to remove "scrubs", and also following Rob and Shane's suggestion will remove the word "any" to match the FTC definition. However, I disagree about "used to infer information about". I consider this a strength of the FTC definition. As has Ed rightly pointed out at the DC workshop on unlinkability, users care about attribute disclosure, not only re-identification. For example, if I know that at least one of ten requests that I see to my web server for the URL https://example.com/embarrassing had to come from you, but I do not know which one belongs to you, then I have no re-identified you from my data set, but I know that you visited https://example.com/embarrassing.

If I know that a particular user visited a particular site,
then that data can be associated with a user and is not de-identified.
To be clear, I am not talking about specific data records -- data
includes any information obtainable from the retained bits.

If I were to split the data into two sets, with one being
de-identified for sharing with others and the other set still
retaining associations with particular users, then I still
haven't de-identified the data in terms of my own retention.

However, if I were to share only the de-identified set with
some other party and keep the rest confidential, then the fact
that it might be re-identified by me does not change the
essential characteristics of the data shared: I am allowed
to share it because the de-identified subset contains no
personal data. The other party is allowed to retain it because
they are unable to re-identify that data.

Likewise, if I retain the identifiable subset only for six weeks
and then delete it (retaining only knowledge of the de-identified
data subset), then I have satisfied the requirement that such data
be de-identified if I want to retain it beyond six weeks.

While I greatly respect the FTC's technical competence in this
area, they aren't exactly concerned with overreaching in their
definitions -- it is only guidance and they have nice humans in
the loop to clarify based on the intent of actual cases.
I, on the other hand, need to know whether it is allowable to
state "10,000 visitors on my site are from Chicago" without
being concerned about being sued just because that information
is about users and was derived from log files that at some
time in the past contained data that could be associated with
a particular user.

>>> publicly commits not to try to re-identify the data, except in order to test the soundness of the de-identified data; and
>> 
>> This is not part of the definition.  We might add such a requirement
>> on processors, but it doesn't belong as the meaning of the term.
> 
> Rob S also brought this up and I'm happy to remove it here, provided we include such a requirement elsewhere to match the FTC language as closely as possible.
> 
>> 
>>> contractually prohibits downstream recipients from trying to re-identify the data.
>>> 
>> This third bullet is not possible.  Please understand that de-identified
>> data includes such things as purely anonymous aggregate counts which
>> are then published openly.  It is absurd to suggest that contracts
>> are necessary (or even useful) to manage the output of deidentified
>> data -- any data that is de-identified is no longer in scope as a
>> concern for this standard.
> I think de-identification is incredible hard. Even data that at first blush you might consider to be totally anonymized could lend itself to re-anonymization or attribute disclosure attacks. AFAIK there is no formal mechanism for proving an anonymized dataset is impervious to re-anonymization attacks (short of deletion). Given this dearth of theoretical understanding on this very new issue, we want to give organizations leeway to keep some data if they do a rigorous anonymization job, but we don't want to tie their hands so much that deletion is the only option. 
> 
> That leaves us in a place where this aggregated data realistically could be re-identified via some sort of clever attack. For some data sets, for example Google's recently released generalized NSL numbers (https://www.google.com/transparencyreport/userdatarequests/US/), the total number of bits of information is small enough that I think a contract is not necessary. However, I don't think it at all absurd to create this contractual obligation for larger data sets, and think it exists in other contexts such as HIPAA. Perhaps we could add a clause that distinguishes the two situations. Happy to consider this for the next draft.

I think you are assuming too much about the retained de-identified
data being fairly close to the original.  As I described, this
definition includes what we would call purely anonymous data:
large numbers placed into categories without any association to
individual requests.

It is simply not a privacy concern to share anonymous data.  Thus, the
TPWG cannot place arduous requirements on sharing all de-identified
data just because of some risk that there might exist some forms of
de-identified data that is easier to re-identify.

If we need such distinctions, then we will need additional
definitions to describe them.  However, I think pursuing such
ghosts is just a waste of time.  Mistakes will happen and do
not need to be accounted for in *this* standard because they
will be dealt with by regulators if or when such identifiable
data is shared.  We cannot solve all privacy problems here.

....Roy
Received on Thursday, 7 March 2013 02:47:09 UTC