RE: de-identification text for Wednesday's call

Shane,

 

If you apply any kind of one-to-one mapping to a unique bit pattern, you get
another unique bit pattern. For any particular input you get the same
output.  Data records collected over time can be chained together using the
transformed identifier as a key to collect an individual's web history. If
the rest of the data record consists of a URI you can transform that in
order to de-identify it by removing encoded PII from it (the query
parameters may include an email address, so you could remove that) but
presumably the URI still contains enough information to extract profiling
variables, e.g. this is a car dealers website and a page selling a
particular type of car.

 

I cannot see how this can never be used to modify a user's experience. There
web history is being collected and they are still being tracked. If they are
not what is the point of retaining the cookie identifier. 

 

A request to a resource with DNT set means the user does not want to be
tracked. The duration of any identifier used to do that should be limited to
that required for a permitted use, if any apply.

 

Mike

 

 

 

From: Shane Wiley [mailto:wileys@yahoo-inc.com] 
Sent: 02 April 2013 16:51
To: Mike O'Neill
Cc: public-tracking@w3.org
Subject: RE: de-identification text for Wednesday's call

 

Mike,

 

Thank you for the input but you miss a key element of the proposal - once
the one-way hash function has been applied the data is never again able to
be accessed in real-time to modify the user's experience.  This is where the
operational and administrative controls - both supported through tech
controls - come into play.  The end goal is that we find the point where
data still has some value but can no longer be used to single out a specific
web browser in real-time to alter their online experience with historical
multi-site (non-affiliated) activity.

 

- Shane

 

From: Mike O'Neill [mailto:michael.oneill@baycloud.com] 
Sent: Tuesday, April 02, 2013 8:27 AM
To: Shane Wiley
Cc: public-tracking@w3.org
Subject: RE: de-identification text for Wednesday's call

 

Shane,

 

If you mean by "anonymous cookie", a cookie stored in a device/UA-session
containing a unique identifier then this not anonymous or "pseudonymous". In
fact it singles-out an individual far more exactly than their name. By
definition there is only one unique identifier whereas there can be several
individuals pointed to by the string "Shane Wiley".

 

If you apply a one-way hash function (or any unique one-to-one mapping) to a
UID you just get another unique identifier. Next time a user visits a page
you decode the cookie, apply the function, and match the resultant bit
pattern to the ones in records you already have. The hash operation serves
no useful purpose whatsoever. If the entropy, or number of bits, were
reduced by the function (it becomes a one-to-many mapping) then maybe, but
what would be the point?

 

All this underlines the importance that unlinkability (as well as
de-identification) be absolutely required to take collected/used data out of
scope.

 

Mike

 

 

From: Shane Wiley [mailto:wileys@yahoo-inc.com] 
Sent: 02 April 2013 15:43
To: Dobbs, Brooks; Dan Auerbach; public-tracking@w3.org
Subject: RE: de-identification text for Wednesday's call

 

Brooks,

 

I believe "delete" is meant to be an option in the mix.  For example, I can
one-way secret hash an already anonymous cookie ID and delete the IP address
and query string in the page URL in a record to move it to a de-identified
state.

 

- Shane

 

From: Dobbs, Brooks [mailto:Brooks.Dobbs@kbmg.com] 
Sent: Tuesday, April 02, 2013 7:25 AM
To: Dan Auerbach; public-tracking@w3.org
Subject: Re: de-identification text for Wednesday's call

 

Perhaps this is pedantic but does it not make sense to remove the deletion
language?  If de-identified is a property of something and something which
does not exist cannot have a property aren't we left with a bit of a
tautological problem by defining de-identified data as having been deleted?
Do we really need to say deleted gets you to a safe place?  Alternatively,
what would someone be doing with deleted data that could put them in
noncompliance?

 

I think the problem is that we never really meant the full instance of a
data "event" being deleted but rather we really meant partial deletion or
deletion of certain elements within an event (e.g. "deletion" of the IP
address within a transaction event in a log file).  If this is the case
wouldn't we be more accurate to describe this procedure using the term
modified or redacted?

 

-Brooks

Sent from my iPhone


On Apr 2, 2013, at 4:22 AM, "Dan Auerbach" <dan@eff.org> wrote:

Hi everyone,

Given that de-identification is on the agenda for Wednesday, I wanted to
send out the current state of the de-identification text. No changes to
normative text were made since the ending point of the last email thread. I
made some small tweaks in order to tighten up the non-normative language,
though nothing has conceptually changed.

We are also putting a pin in the issue of requirements and commitments that
a DNT-compliant entity must make with respect to de-identification. I think
such a specific commitment is warranted, but we agreed to have that
discussion separately.

Thanks again to everyone for the feedback,
Dan

Normative text:

Data can be considered sufficiently de-identified to the extent that it has
been deleted, modified, aggregated, anonymized or otherwise manipulated in
order to achieve a reasonable level of justified confidence that the data
cannot reasonably be used to infer information about, or otherwise be linked
to, a particular user, user agent, or device.

Non-normative text:

Example 1. In general, using unique or near-unique pseudonymous identifiers
to link records of a particular user, user agent, or device within a large
data set does NOT provide sufficient de-identification. Even absent obvious
identifiers such as names, email addresses, or zip codes, there are many
ways to gain information about individuals based on pseudonymous data.

Example 2. In general, keeping only high-level aggregate data across a small
number of dimensions, such as the total number of visitors of a website each
day broken down by country (discarding data from countries without many
visitors), would be considered sufficiently de-identified.

Example 3. Deleting data is always a safe and easy way to achieve
de-identification.

Remark 1. De-identification is a property of data. If data can be considered
de-identified according to the "reasonable level of justified confidence"
clause of (1), then no data manipulation process needs to take place in
order to satisfy the requirements of (1).

Remark 2. There are a diversity of techniques being researched and developed
to de-identify data sets [1][2], and companies are encouraged to explore and
innovate new approaches to fit their needs.

Remark 3. It is a best practice for companies to perform "privacy
penetration testing" by having an expert with access to the data attempt to
re-identify individuals or disclose attributes about them. The expert need
not actually identify or disclose the attribute of an individual, but if the
expert demonstrates how this could plausibly be achieved by joining the data
set against other public data sets or private data sets accessible to the
company, then the data set in question should no longer be considered
sufficiently de-identified and changes should be made to provide stronger
anonymization for the data set.

[1] https://research.microsoft.com/pubs/116123/dwork_cacm.pdf

[2] http://www.cs.purdue.edu/homes/ninghui/papers/t_closeness_icde07.pdf

 

-- 
Dan Auerbach
Staff Technologist
Electronic Frontier Foundation
dan@eff.org
415 436 9333 x134

Received on Tuesday, 2 April 2013 16:57:23 UTC