RE: Deidentification (ISSUE-188) from Mitchell Eisenberg on 2014-07-23 (public-tracking@w3.org from July 2014)

From: Mitchell Eisenberg <meisenberg@pulsepoint.com>
Date: Wed, 23 Jul 2014 13:40:57 +0000
To: "public-tracking@w3.org" <public-tracking@w3.org>
Message-ID: <580D4B426C58974F959B3205835C4E42291C8976@NY-EXCHANGE02.pulse.corp>
How do I remove an email from this list?

From: Justin Brookman [mailto:jbrookman@cdt.org]
Sent: Wednesday, July 23, 2014 9:25 AM
To: Roy T. Fielding
Cc: public-tracking@w3.org List; David Singer
Subject: Re: Deidentification (ISSUE-188)

Different questions to Roy and David about their proposals:

Roy, on the call last week, you said that if data can be tied to a user agent or device, then it wasn't deidentified.  Nick proposed adding ", user agent, or device" to the end of your definition to make that clear.  So it would read:

A data set is considered de-identified when there exists a reasonable level of justified confidence that the data within it cannot be used to infer information about, or otherwise be linked to, a particular user, user agent, or device.

However, from the minutes, at some point you rejected some amendment - not sure if it's this one or not.

David (and others), Roy essentially removed the second and third clauses from your definitions.  Not entirely clear on why the second was removed, but for the third, he points out that doesn't work for data that you make publicly available.  That is, if you report some sort of aggregate statistic (e..g., 28% of the browsers I saw today were Firefox), you can't realistically make everyone promise not to try to reverse engineer that data before publishing.

Do you want to argue for retaining the contractual prohibition language, or are you willing to merge your proposal with Roy's?  Or perhaps Roy would be amenable to retaining the second clause - that the company itself should commit not to try to reidentify the data.

On Jul 23, 2014, at 12:11 AM, Roy T. Fielding <fielding@gbiv.com<mailto:fielding@gbiv.com>> wrote:


On Jul 19, 2014, at 7:23 AM, TOUBIANA Vincent wrote:


With the  property on linkability, in an anonymized dataset you should not be able to link two transactions from the same device and you should not be able to link a transaction to another dataset.
No.  That is simply wrong.  All session-based interactions with users depend on the linking of multiple interactions over time, each of which must remain linked in the dataset if the site is going to make any meaningful use of them.  Linking data records doesn't have anything to do with privacy or EU data protection.


If data records can be linked then there is a significant chance we're not talking about anonymized data.

We aren't talking about a definition of significant chance.
De-identified is a state of being -- either it is or it isn't.

Linking many records together with a transaction-id, for example, has
nothing to do with whether the records can be linked to a user.
They might just be all the records related to a product SKU.
We can't say that a data set isn't de-identified just because there
exists some common field among the records.

If the records cannot be linked to a user, they do not represent a
privacy risk.  If any of the same-transaction-id records can be linked
to the user, then all of them can and the data is not de-identified.
The number of records simply doesn't matter.  What matters is at least
one of them (or some correlation of them) can be linked to a particular
user.


That does not, in any way, imply that the data set remains linked to the user, which is what linkability means to data protection. (Linking to the user's device is just an indirect linking to the user).

With respect to anonymization techniques, this is the most suitable definition. We had a similar discussion about the definition of unlikability a while ago, I suggested using the terminology of ISO/IEC 15408-2 (http://lists.w3.org/Archives/Public/public-tracking/2012Nov/0255.html).


The Article 29 definition of linkability is simply wrong: it seems to be entirely misinformed about what that term means with regard to data protection. Maybe that's why we are calling it de-identification instead.

Again with respect to anonymization this is the definition provided by ISO/IEC 15408-2.  If you have a different definition which is more widely used, please provide a link to it.

We aren't defining anonymization techniques (a process).  As I noted,
the definition you provide does not address whether the data remains linked
to a user, which is the only thing that matters for our use of the term.
Whatever the ISO/IEC definition provides, it isn't how we are using
that term in our discussions.


You are trying to prevent identifying a user via data correlation and have construed the definition as if that is all that matters. As a side-effect, you are preventing normal operation of a site
in terms of evaluating which user agent software doesn't work well,or where to place UI elements in a window, or what sets of content lead to a conversion (as opposed to boredom or leaving the site).

I don't think how this would happened in a *third party* context. In a first party context this would be allowed so I'm not sure I would prevent any normal operation.

We are talking about the definition for a term.  There is no context.
There is no party.  Those other terms only matter when we talk about
requirements associated with data collected in a specific network
interaction.


My proposed text says that the dataset is de-identified when it cannot be used to identify a particular user.  How it might be used to do so is irrelevant -- any mechanism applies, including data correlation.

How do you practically evaluate that?

The same way that all privacy researchers do: I take the data set and
try to find someone.  If I succeed, the data is clearly not de-identified.
Otherwise, I look at the data for statistical patterns and try again.
But this assumes I don't already know.  There are ways to progressively
reduce an identified data set such that there are no useful correlations,
assuming the records start with some form of user id (to be sure that
everything that might be unique to a small set of users is removed).
Those are anonymization techniques.  But that doesn't make a process
of anonymization suitable as a definition for the desired end-result.

Our requirements don't need to specify how the data becomes
de-identified.  The party retaining the data is on the hook based on
their statements of compliance, which is sufficient to hold
the occasional (inevitable) failures to account.


You cannot provide a method to de-identify data with a high level of confidence. It is not enforceable. This definition does not provide any guarantee to the user that, in practice, he cannot be identified he has to rely on the statement. From a data-controller point of view you would have to constantly re-evaluate if the dataset is de-identified.

None of that has anything to do with the definition of de-identified.
You are saying it is difficult to get there.  I agree.  That the user
doesn't have any guarantees.  Again, I agree.  And, yes, the data
controller is already responsible for their own statements.  What we are
using the definition for is a machine recognizable statement, not a guarantee.
That doesn't change the definition of what state the data needs to be in
for the statement to be true.

Having a definition of where we want the data to be doesn't prevent
other folks from doing research and defining the best techniques
on how to get there.

.....Roy
Received on Friday, 25 July 2014 11:35:14 UTC