Re: de-identification text for Wednesday's call from Dan Auerbach on 2013-04-02 (public-tracking@w3.org from April 2013)

From: Dan Auerbach <dan@eff.org>
Date: Tue, 02 Apr 2013 10:50:57 -0700
To: public-tracking@w3.org
Message-ID: <515B1A81.2040908@eff.org>
Brooks, Thanks for the feedback. I'm happy to take your suggestion and
sharpen the language by substituting 'redacted' or 'partially deleted'
for 'deleted'.

Shane,

The non-normative examples are important, so we unfortunately may be
disagreeing about the substance of the definition. There is a concrete
question on the table: does log data that has been partially redacted
but not unlinked in any way count as de-identified? We could sharpen the
question even further by specifying how much entropy is allowed per
record, after it has been scrubbed. Given episodes of re-identification
and attribute disclosure based on far fewer records and far fewer bits
of information per record (e.g. [1] [2]), I think as a group we should
decide that minimally scrubbed log data, in general, does not meet the
bar for de-identification. I think there are some instances when you can
keep linked records via pseudonyms (if there are only a few bits of
information per record) and still consider it de-identified, so I don't
want our definition to limit those cases, but they will be cases in
which the entropy is low. Generally speaking, aggregation will be
necessary for ordinary log data.

It's important to keep in mind that de-identification is a carve-out for
the otherwise stronger and clearer policy position the group could take
which is that data must be deleted. We don't want that because we want
to preserve aggregated data uses so long as the danger to the individual
user is minuscule. However, given that it is a carve-out, the burden of
proof should lie with those who want to keep data more aggressively to
demonstrate that it is safe to do so. Perhaps we could (publicly or
non-publicly) examine a data set that you believe meets this bar?

If it would make you more comfortable if the non-normative example
discussed the amount of entropy per record at length, to clarify that it
*sometimes* might be OK to have linked data, then I'd be happy to do that.

Dan

[1] http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf
<http://www.cs.utexas.edu/%7Eshmat/shmat_oak08netflix.pdf>
[2] https://crypto.stanford.edu/~pgolle/papers/census.pdf
<https://crypto.stanford.edu/%7Epgolle/papers/census.pdf>

On 04/02/2013 08:50 AM, Shane Wiley wrote:
>
> Mike,
>
>  
>
> Thank you for the input but you miss a key element of the proposal --
> once the one-way hash function has been applied the data is never
> again able to be accessed in real-time to modify the user's
> experience.  This is where the operational and administrative controls
> -- both supported through tech controls -- come into play.  The end
> goal is that we find the point where data still has some value but can
> no longer be used to single out a specific web browser in real-time to
> alter their online experience with historical multi-site
> (non-affiliated) activity.
>
>  
>
> - Shane
>
>  
>
> *From:*Mike O'Neill [mailto:michael.oneill@baycloud.com]
> *Sent:* Tuesday, April 02, 2013 8:27 AM
> *To:* Shane Wiley
> *Cc:* public-tracking@w3.org
> *Subject:* RE: de-identification text for Wednesday's call
>
>  
>
> Shane,
>
>  
>
> If you mean by "anonymous cookie", a cookie stored in a
> device/UA-session containing a unique identifier then this not
> anonymous or "pseudonymous". In fact it singles-out an individual far
> more exactly than their name. By definition there is only one unique
> identifier whereas there can be several individuals pointed to by the
> string "Shane Wiley".
>
>  
>
> If you apply a one-way hash function (or any unique one-to-one
> mapping) to a UID you just get another unique identifier. Next time a
> user visits a page you decode the cookie, apply the function, and
> match the resultant bit pattern to the ones in records you already
> have. The hash operation serves no useful purpose whatsoever. If the
> entropy, or number of bits, were reduced by the function (it becomes a
> one-to-many mapping) then maybe, but what would be the point?
>
>  
>
> All this underlines the importance that unlinkability (as well as
> de-identification) be absolutely required to take collected/used data
> out of scope.
>
>  
>
> Mike
>
>  
>
>  
>
> *From:*Shane Wiley [mailto:wileys@yahoo-inc.com]
> *Sent:* 02 April 2013 15:43
> *To:* Dobbs, Brooks; Dan Auerbach; public-tracking@w3.org
> <mailto:public-tracking@w3.org>
> *Subject:* RE: de-identification text for Wednesday's call
>
>  
>
> Brooks,
>
>  
>
> I believe "delete" is meant to be an option in the mix.  For example,
> I can one-way secret hash an already anonymous cookie ID and delete
> the IP address and query string in the page URL in a record to move it
> to a de-identified state.
>
>  
>
> - Shane
>
>  
>
> *From:*Dobbs, Brooks [mailto:Brooks.Dobbs@kbmg.com]
> *Sent:* Tuesday, April 02, 2013 7:25 AM
> *To:* Dan Auerbach; public-tracking@w3.org <mailto:public-tracking@w3.org>
> *Subject:* Re: de-identification text for Wednesday's call
>
>  
>
> Perhaps this is pedantic but does it not make sense to remove the
> deletion language?  If de-identified is a property of something and
> something which does not exist cannot have a property aren't we left
> with a bit of a tautological problem by defining de-identified data as
> having been deleted?  Do we really need to say deleted gets you to a
> safe place?  Alternatively, what would someone be doing with deleted
> data that could put them in noncompliance?
>
>  
>
> I think the problem is that we never really meant the full instance of
> a data "event" being deleted but rather we really meant partial
> deletion or deletion of certain elements within an event (e.g.
> "deletion" of the IP address within a transaction event in a log
> file).  If this is the case wouldn't we be more accurate to describe
> this procedure using the term modified or redacted?
>
>  
>
> -Brooks
>
> Sent from my iPhone
>
>
> On Apr 2, 2013, at 4:22 AM, "Dan Auerbach" <dan@eff.org
> <mailto:dan@eff.org>> wrote:
>
>     Hi everyone,
>
>     Given that de-identification is on the agenda for Wednesday, I
>     wanted to send out the current state of the de-identification
>     text. No changes to normative text were made since the ending
>     point of the last email thread. I made some small tweaks in order
>     to tighten up the non-normative language, though nothing has
>     conceptually changed.
>
>     We are also putting a pin in the issue of requirements and
>     commitments that a DNT-compliant entity must make with respect to
>     de-identification. I think such a specific commitment is
>     warranted, but we agreed to have that discussion separately.
>
>     Thanks again to everyone for the feedback,
>     Dan
>
>     Normative text:
>
>     Data can be considered sufficiently de-identified to the extent
>     that it has been deleted, modified, aggregated, anonymized or
>     otherwise manipulated in order to achieve a reasonable level of
>     justified confidence that the data cannot reasonably be used to
>     infer information about, or otherwise be linked to, a particular
>     user, user agent, or device.
>
>     Non-normative text:
>
>     Example 1. In general, using unique or near-unique pseudonymous
>     identifiers to link records of a particular user, user agent, or
>     device within a large data set does NOT provide sufficient
>     de-identification. Even absent obvious identifiers such as names,
>     email addresses, or zip codes, there are many ways to gain
>     information about individuals based on pseudonymous data.
>
>     Example 2. In general, keeping only high-level aggregate data
>     across a small number of dimensions, such as the total number of
>     visitors of a website each day broken down by country (discarding
>     data from countries without many visitors), would be considered
>     sufficiently de-identified.
>
>     Example 3. Deleting data is always a safe and easy way to achieve
>     de-identification.
>
>     Remark 1. De-identification is a property of data. If data can be
>     considered de-identified according to the "reasonable level of
>     justified confidence" clause of (1), then no data manipulation
>     process needs to take place in order to satisfy the requirements
>     of (1).
>
>     Remark 2. There are a diversity of techniques being researched and
>     developed to de-identify data sets [1][2], and companies are
>     encouraged to explore and innovate new approaches to fit their needs.
>
>     Remark 3. It is a best practice for companies to perform "privacy
>     penetration testing" by having an expert with access to the data
>     attempt to re-identify individuals or disclose attributes about
>     them. The expert need not actually identify or disclose the
>     attribute of an individual, but if the expert demonstrates how
>     this could plausibly be achieved by joining the data set against
>     other public data sets or private data sets accessible to the
>     company, then the data set in question should no longer be
>     considered sufficiently de-identified and changes should be made
>     to provide stronger anonymization for the data set.
>
>     [1] https://research.microsoft.com/pubs/116123/dwork_cacm.pdf
>
>     [2]
>     http://www.cs.purdue.edu/homes/ninghui/papers/t_closeness_icde07.pdf
>
>      
>
>     -- 
>
>     Dan Auerbach
>
>     Staff Technologist
>
>     Electronic Frontier Foundation
>
>     dan@eff.org <mailto:dan@eff.org>
>
>     415 436 9333 x134
>


-- 
Dan Auerbach
Staff Technologist
Electronic Frontier Foundation
dan@eff.org
415 436 9333 x134
Received on Tuesday, 2 April 2013 17:51:29 UTC