Re: de-identification text for Wednesday's call from Dobbs, Brooks on 2013-04-02 (public-tracking@w3.org from April 2013)

From: Dobbs, Brooks <Brooks.Dobbs@kbmg.com>
Date: Tue, 2 Apr 2013 15:02:52 +0000
To: Shane Wiley <wileys@yahoo-inc.com>, Dan Auerbach <dan@eff.org>, "public-tracking@w3.org" <public-tracking@w3.org>
Message-ID: <CD806732.B23E4%brooks.dobbs@kbmg.com>
Shane,

So in your example "the data" was not deleted;  it was both modified (the hash) and redacted (the "removal" of the IP portion).  With as much confusion as we have on language why not be clear where we can?  I had hoped this would not be controversial and would add clarity.

The practical point is that if I have the following line in a log file:

58.218.199.250 - - [01/Apr/2013:02:23:22] "GET /sports/braves/index.html"  401 1255 "-" "Mozzila/4.0" "guid=abcdef1234567"

Under your example we could de-identify it by rendering it as:

 - - - [01/Apr/2013:02:23:22] "GET /sports/braves/index.html"  401 1255 "-" "Mozila/4.0" "guid=HASH{abcdef1234567}"

Just for purpose of clarity the remaining data set hasn't been "deleted", most of it is still there, it has been redacted and modified.
Again, I am not trying to be tricky here. I am just trying to make it more sensible.

-Brooks







--

Brooks Dobbs, CIPP | Chief Privacy Officer | KBM Group | Part of the Wunderman Network
(Tel) 678 580 2683 | (Mob) 678 492 1662 | kbmg.com
brooks.dobbs@kbmg.com

[cid:E492A379-E9C0-4CBF-B5D2-892B667AC18A]

This email – including attachments – may contain confidential information. If you are not the intended recipient,
 do not copy, distribute or act on it. Instead, notify the sender immediately and delete the message.

From: Shane Wiley <wileys@yahoo-inc.com<mailto:wileys@yahoo-inc.com>>
Date: Tuesday, April 2, 2013 10:43 AM
To: Brooks Dobbs <brooks.dobbs@kbmg.com<mailto:brooks.dobbs@kbmg.com>>, Dan Auerbach <dan@eff.org<mailto:dan@eff.org>>, "public-tracking@w3.org<mailto:public-tracking@w3.org>" <public-tracking@w3.org<mailto:public-tracking@w3.org>>
Subject: RE: de-identification text for Wednesday's call

Brooks,

I believe “delete” is meant to be an option in the mix.  For example, I can one-way secret hash an already anonymous cookie ID and delete the IP address and query string in the page URL in a record to move it to a de-identified state.

- Shane

From: Dobbs, Brooks [mailto:Brooks.Dobbs@kbmg.com]
Sent: Tuesday, April 02, 2013 7:25 AM
To: Dan Auerbach; public-tracking@w3.org<mailto:public-tracking@w3.org>
Subject: Re: de-identification text for Wednesday's call

Perhaps this is pedantic but does it not make sense to remove the deletion language?  If de-identified is a property of something and something which does not exist cannot have a property aren't we left with a bit of a tautological problem by defining de-identified data as having been deleted?  Do we really need to say deleted gets you to a safe place?  Alternatively, what would someone be doing with deleted data that could put them in noncompliance?

I think the problem is that we never really meant the full instance of a data "event" being deleted but rather we really meant partial deletion or deletion of certain elements within an event (e.g. "deletion" of the IP address within a transaction event in a log file).  If this is the case wouldn't we be more accurate to describe this procedure using the term modified or redacted?

-Brooks

Sent from my iPhone

On Apr 2, 2013, at 4:22 AM, "Dan Auerbach" <dan@eff.org<mailto:dan@eff.org>> wrote:
Hi everyone,

Given that de-identification is on the agenda for Wednesday, I wanted to send out the current state of the de-identification text. No changes to normative text were made since the ending point of the last email thread. I made some small tweaks in order to tighten up the non-normative language, though nothing has conceptually changed.

We are also putting a pin in the issue of requirements and commitments that a DNT-compliant entity must make with respect to de-identification. I think such a specific commitment is warranted, but we agreed to have that discussion separately.

Thanks again to everyone for the feedback,
Dan

Normative text:

Data can be considered sufficiently de-identified to the extent that it has been deleted, modified, aggregated, anonymized or otherwise manipulated in order to achieve a reasonable level of justified confidence that the data cannot reasonably be used to infer information about, or otherwise be linked to, a particular user, user agent, or device.

Non-normative text:

Example 1. In general, using unique or near-unique pseudonymous identifiers to link records of a particular user, user agent, or device within a large data set does NOT provide sufficient de-identification. Even absent obvious identifiers such as names, email addresses, or zip codes, there are many ways to gain information about individuals based on pseudonymous data.

Example 2. In general, keeping only high-level aggregate data across a small number of dimensions, such as the total number of visitors of a website each day broken down by country (discarding data from countries without many visitors), would be considered sufficiently de-identified.

Example 3. Deleting data is always a safe and easy way to achieve de-identification.

Remark 1. De-identification is a property of data. If data can be considered de-identified according to the “reasonable level of justified confidence” clause of (1), then no data manipulation process needs to take place in order to satisfy the requirements of (1).

Remark 2. There are a diversity of techniques being researched and developed to de-identify data sets [1][2], and companies are encouraged to explore and innovate new approaches to fit their needs.

Remark 3. It is a best practice for companies to perform “privacy penetration testing” by having an expert with access to the data attempt to re-identify individuals or disclose attributes about them. The expert need not actually identify or disclose the attribute of an individual, but if the expert demonstrates how this could plausibly be achieved by joining the data set against other public data sets or private data sets accessible to the company, then the data set in question should no longer be considered sufficiently de-identified and changes should be made to provide stronger anonymization for the data set.

[1] https://research.microsoft.com/pubs/116123/dwork_cacm.pdf

[2] http://www.cs.purdue.edu/homes/ninghui/papers/t_closeness_icde07.pdf




--

Dan Auerbach

Staff Technologist

Electronic Frontier Foundation

dan@eff.org<mailto:dan@eff.org>

415 436 9333 x134
Attachments

image/png attachment: image.png
Received on Tuesday, 2 April 2013 15:03:23 UTC