Re: ACTION-371: text defining de-identified data from Dan Auerbach on 2013-03-11 (public-tracking@w3.org from March 2013)

From: Dan Auerbach <dan@eff.org>
Date: Mon, 11 Mar 2013 14:43:54 -0700
To: public-tracking@w3.org
Message-ID: <513E501A.7090401@eff.org>
I've responded inline to a couple of comments. For now I've put a pin in
the issue of operational requirements on companies with respect to
de-identified data (shall I create a new issue for this?). Here is the
second pass at the normative text for de-identification language,
incorporating suggestions:

Normative text:

Data can be considered sufficiently de-identified to the extent that it
has been deleted, modified, aggregated, anonymized or otherwise
manipulated in order to achieve a reasonable level of justified
confidence that the data cannot reasonably be used to infer information
about, or otherwise be linked to, a particular user, user agent, or device.




On 03/06/2013 06:46 PM, Roy T. Fielding wrote:
> On Mar 6, 2013, at 4:43 PM, Dan Auerbach wrote:
>> On 03/06/2013 02:10 PM, Roy T. Fielding wrote:
>>> What we care about preventing is the link to a particular user.
>>> Including all of this other verbiage is just losing the point of
>>> the definition and interfering with established best practice
>>> with anonymous data.
>> I am happy to remove "scrubs", and also following Rob and Shane's
>> suggestion will remove the word "any" to match the FTC definition.
>> However, I disagree about "used to infer information about". I
>> consider this a strength of the FTC definition. As has Ed rightly
>> pointed out at the DC workshop on unlinkability, users care about
>> attribute disclosure, not only re-identification. For example, if I
>> know that at least one of ten requests that I see to my web server
>> for the URL https://example.com/embarrassing had to come from you,
>> but I do not know which one belongs to you, then I have no
>> re-identified you from my data set, but I know that you visited
>> https://example.com/embarrassing.
>
> If I know that a particular user visited a particular site,
> then that data can be associated with a user and is not de-identified.
> To be clear, I am not talking about specific data records -- data
> includes any information obtainable from the retained bits.
>
> If I were to split the data into two sets, with one being
> de-identified for sharing with others and the other set still
> retaining associations with particular users, then I still
> haven't de-identified the data in terms of my own retention.
>
> However, if I were to share only the de-identified set with
> some other party and keep the rest confidential, then the fact
> that it might be re-identified by me does not change the
> essential characteristics of the data shared: I am allowed
> to share it because the de-identified subset contains no
> personal data. The other party is allowed to retain it because
> they are unable to re-identify that data.
>
> Likewise, if I retain the identifiable subset only for six weeks
> and then delete it (retaining only knowledge of the de-identified
> data subset), then I have satisfied the requirement that such data
> be de-identified if I want to retain it beyond six weeks.

I agree with all of this and it is a good example. However, it is
important to guard against sloppy data releases that can be joined with
public or otherwise available data sets in order to easily re-identify
individuals.

>
> While I greatly respect the FTC's technical competence in this
> area, they aren't exactly concerned with overreaching in their
> definitions -- it is only guidance and they have nice humans in
> the loop to clarify based on the intent of actual cases.
> I, on the other hand, need to know whether it is allowable to
> state "10,000 visitors on my site are from Chicago" without
> being concerned about being sued just because that information
> is about users and was derived from log files that at some
> time in the past contained data that could be associated with
> a particular user.

There are obvious cases that would be acceptably de-identified (10K
visitors from Chicago), cases that would be unacceptably de-identified
(full browing history assoc with hashed cookies), along with more gray
area cases. I hope that the non-normative text can give guidance about
this, given incomplete theoretical understanding of what separates these
various scenarios. I don't see any reason to believe that the obviously
OK cases will be the norm in practice.


>
>>>> 2.
>>>>
>>>>     publicly commits not to try to re-identify the data, except in
>>>>     order to test the soundness of the de-identified data; and
>>>>
>>>
>>> This is not part of the definition.  We might add such a requirement
>>> on processors, but it doesn't belong as the meaning of the term.
>>
>> Rob S also brought this up and I'm happy to remove it here, provided
>> we include such a requirement elsewhere to match the FTC language as
>> closely as possible.
>>
>>>
>>>> 3.
>>>>
>>>>     contractually prohibits downstream recipients from trying to
>>>>     re-identify the data.
>>>>
>>>>
>>> This third bullet is not possible.  Please understand that de-identified
>>> data includes such things as purely anonymous aggregate counts which
>>> are then published openly.  It is absurd to suggest that contracts
>>> are necessary (or even useful) to manage the output of deidentified
>>> data -- any data that is de-identified is no longer in scope as a
>>> concern for this standard.
>> I think de-identification is incredible hard. Even data that at first
>> blush you might consider to be totally anonymized could lend itself
>> to re-anonymization or attribute disclosure attacks. AFAIK there is
>> no formal mechanism for proving an anonymized dataset is impervious
>> to re-anonymization attacks (short of deletion). Given this dearth of
>> theoretical understanding on this very new issue, we want to give
>> organizations leeway to keep some data if they do a rigorous
>> anonymization job, but we don't want to tie their hands so much that
>> deletion is the only option.
>>
>> That leaves us in a place where this aggregated data realistically
>> could be re-identified via some sort of clever attack. For some data
>> sets, for example Google's recently released generalized NSL numbers
>> (https://www.google.com/transparencyreport/userdatarequests/US/), the
>> total number of bits of information is small enough that I think a
>> contract is not necessary. However, I don't think it at all absurd to
>> create this contractual obligation for larger data sets, and think it
>> exists in other contexts such as HIPAA. Perhaps we could add a clause
>> that distinguishes the two situations. Happy to consider this for the
>> next draft.
>
> I think you are assuming too much about the retained de-identified
> data being fairly close to the original.  As I described, this
> definition includes what we would call purely anonymous data:
> large numbers placed into categories without any association to
> individual requests.
>
> It is simply not a privacy concern to share anonymous data.
What we are discussing is what makes data anonymous. In an ideal world,
there would be a clean separation of what you are calling "purely
anonymous" data and messy real-world data, but unfortunately I don't
think that's the case. Bucketing data with broad enough granularity
passes the "obviously anonymous" test, but if we extend the idea of
bucketing to a natural formalism -- k-anonymity -- we find something
that may work in many scenarios, but still is subject to attacks (e.g.
http://dl.acm.org/citation.cfm?id=1217302). I don't trying to understand
this in detail is a purely academic concern. You may want to release not
just the fact that 10K users visited from Chicago, but more bucketed
analytics information as well. It may not be at all close to the
original data set, but bucketing by 3 or 4 dimensions, for example,
could easily be quite problematic (the anonymity set of (Chicago, Bing,
Silverlight) could be 1, for example). Without getting too deep into a
hypothetical example, I'll just say that k-anon is better than the ad
hoc bucketing, and should be used as a minimum bar whenever you want to
release bucketed data, but it also can be problematic, as Ed talked
about at the f2f.

>  Thus, the
> TPWG cannot place arduous requirements on sharing all de-identified
> data just because of some risk that there might exist some forms of
> de-identified data that is easier to re-identify.
>
> If we need such distinctions, then we will need additional
> definitions to describe them.  However, I think pursuing such
> ghosts is just a waste of time.  Mistakes will happen and do
> not need to be accounted for in *this* standard because they
> will be dealt with by regulators if or when such identifiable
> data is shared.  We cannot solve all privacy problems here.

I agree that we don't want to burden companies for all aggregate data
releases. But having the standard require some amount of caution in
releasing data seems appropriate to me, though I feel less strongly
about this issue than the idea that a *real effort* must be made to
de-identify data in the first place.

If we require elsewhere in the document a public commitment from a
company not to re-identify data, perhaps we can add language suggesting
that if a (de-identified) data set is large enough to seem reasonably at
risk of re-identification, then [insert FTC language]. If not normative
text, I'd favor at least saying that it is a best practice to be
cautious about large data releases where there is a chance of
re-identification. I think that this is important since the W3C standard
will be used to guide regulators like the FTC.

>
> ....Roy
>


-- 
Dan Auerbach
Staff Technologist
Electronic Frontier Foundation
dan@eff.org
415 436 9333 x134
Received on Monday, 11 March 2013 21:44:28 UTC