RE: ACTION-371: text defining de-identified data from Shane Wiley on 2013-03-15 (public-tracking@w3.org from March 2013)

From: Shane Wiley <wileys@yahoo-inc.com>
Date: Fri, 15 Mar 2013 18:15:46 +0000
To: Rob van Eijk <rob@blaeu.com>, Dan Auerbach <dan@eff.org>, "public-tracking@w3.org" <public-tracking@w3.org>
Message-ID: <DCCF036E573F0142BD90964789F720E313667093@GQ1-EX10-MB03.y.corp.yahoo.com>
Rob,

“no wiggle-room” – this is my core concern with some of this direction.  The current definition relies on terms such as “reasonable” (matches up well with EU concepts of “likely reasonable”).  Much like HIPPA, this gives us a risk-based model to de-identification management.  If an organization states its W3C DNT compliant and articulates their de-identification process, I believe it’s important to provide “wiggle-room” for organizations to implement de-identification in a manner they see appropriate to their particular business model, technical tools, administrative and operational processes.  The important outcome is that information that has been de-identified not later become identified.  If an organization is willing to make that public claim and they later prove unable to follow-through on their commitment, local legal remedies will take over from there.

As I stated in Berlin, I believe notions of red, yellow, and green are problematic as they bring a judgmental lens to these states (red = danger, yellow = caution).  I agree with Dan that there should only be two states: raw and de-identified.

- Shane

From: Rob van Eijk [mailto:rob@blaeu.com]
Sent: Friday, March 15, 2013 10:47 AM
To: Dan Auerbach; public-tracking@w3.org; Shane Wiley
Subject: Re: ACTION-371: text defining de-identified data


Dan,

Thanks for the thoughtfull reply.
I understand now that we are on the same page.

But I doubt that Shane is on that same page as well. If I understand Shane's position correctly, his view on de-identified does not come close to the green as I would like it to be. I just want to be absolutely sure that there is no wiggle-room in what it means to reach de-identified.

@Shane: what is your view, taking into account the rely from Dan?

Rob


Dan Auerbach <dan@eff.org<mailto:dan@eff.org>> wrote:

My view is that we do NOT need to define a third state of data. We have
green and red now. If a compelling argument is made that an orange state
is needed, we can revisit, but I think that existing permitted uses plus
having a small time frame for processing raw event data are strong
enough protections to not warrant this third state. Second, regarding
nomenclature, the FTC definition actually defines unlinkability in terms
of de-identification, so I think it would be very confusing to stray too
far from that definitional framework.

A couple further replies inline:

On 03/14/2013 04:09 AM, Justin Brookman wrote:

OK, but as I said before, the standard does not currently envision
three states of dat!

 a.  As

written, all data pertaining to a network
communication is in scope, unless it is deidentified,* in which case
it is out of scope.  You need to propose a third consequence for a new
class of data for this to have effect.

* Noting that there is still ongoing discussion about what
"deidentified" actually means, as evidenced by the recent emails from
Ed, Shane, and Dan.

Justin Brookman
Director, Consumer Privacy
Center for Democracy & Technology
tel 202.407.8812
justin@cdt.org<mailto:justin@cdt.org>
http://www.cdt.org

@JustinBrookman
@CenDemTech

On 3/14/2013 5:39 AM, Rob van Eijk wrote:


In Boston Shane and I discussed the process of de-identification by
applying it to my mental model (red, orange and green data). Red data
is raw e!

 vent

level data (eg log files with unique identifiers),
orange is still linkable but de-identified data, green is unlinkable
and therefore anonymous data.

We agreed that in order to move from red to orange, or from orange to
green, one needs to pass the barriers by processing. As seen in the
de-identrification workshop there are multiple ways to do that. I
illustrated 2 alternative practices:

1. One example is based on concatenating a random number to the
unique ID. This results in a lookup table of unique ID <-> random
number.
Getting from orange to red is braking the link (un-linkiability) by
throwing away the unique ID. No new red data can be linked to the
un-linkable data in the green.

I think the trouble with this model is the assumption that the unique ID
will be the only means of identifying someone. If you'll allow me to
stick with the conceptual framewor!

 k of a

table for simplicity (think
mysql table or bigtable), I think we should get away from the mentality
that there are "identifiers" -- fields like udids, cookies, IPs, phone
numbers etc. Instead, it is more accurate to say that *every* field of a
data set provides some bits of identifying information.

An "orange" data set as you describe might still be super identifying,
if, for example, it is a wide table with lots of fields. As a concrete
example, URLs can be very identifying in some cases, as can timestamps.
Even data that you describe as "green" could still be identifying, if I
understand you correctly. In many instances, having events linked by a
random irreversible identifier (e.g. discarded salt) is simply not
enough to ensure that information can't be reasonably obtained about
users. In some cases it might be, but it depends a lot on that nature of
the rest of the data in the table.


!



2. The other example is based on rotating hashes. Getting from red to
orange is applying the hash. Getting from orange to green is braking
the link (un-linkability) by throwing away the salt. No new red data
can be linked to the un-linkable data in the green.



So I am willing to give up the word unlinkable in the normative
de-identification text, but in exchange non-normative examples should
be added.

I think it's a good suggestion to say that the non-normative examples
should be fleshed out. But I agree that they should suggest a stronger
version of "green" than I understand from your mental model above (which
I hope I'm getting right).





<

non-normative text)
De-identification can be accomplished by applying a mental model
(red, orange and green data). Red data is raw event level data (eg
log files with unique identifiers), orange is still linkable but
de-identified data, green is unlinkable and therefore anonymous data.

In order to move from red to orange, or from orange to green, one
needs to pass the barriers by processing. There are multiple ways to
do that:

1. One example is based on concatenating a random number to the
unique ID. This results in a lookup table of unique ID <-> random
number.
Getting from orange to red is braking the link (un-linkiability) by
throwing away !

 the

unique ID. No new red data can be linked to the
un-linkable data in the green.

2. Another example is based on rotating hashes. Getting from red to
orange is applying the hash. Getting from orange to green is braking
the link (un-linkability) by throwing away the salt. No new red data
can be linked to the un-linkable data in the green.
</non-normative text)


Rob


Dan Auerbach schreef op 2013-03-13 19:01:

I also agree that we should just stick with de-identified, just as a
point of nomenclature. For one, unlike what you propose below, Rob,
the FTC text actually defines unlinkability in terms of
de-identification, so I think it would be very confusing if we did the
opposite here.

That said, we did NOT agree at the face-to-face that unlinkability
was a !

 "step

beyond de-identified"; we are not at all weakening the
standard with our word choice. For unlinkability and de-identification
both, we do NOT propose a holy grail of provably perfect anonymization
that can't be achieved in practice (or even in theory, really!).
However, for both we require a significantly higher standard than, for
example, keeping a pseudonymous data set of browsing history. The
first non-normative example is intended to make this clear, but I can
flesh it out if it's not.

Dan

On 03/13/2013 10:28 AM, Shane Wiley wrote:

Ed,

Agreed - reasonably attempting to clear unique identifiers or
information that could lead to unique identification in URLs should
also be included.

- Shane

FROM: Edward W. Felten [mailto:felten@CS.Princeton.EDU]SENT:

Wednesday, March 13, 2013 10:22 AM
TO: Justin Brookman
CC: <public-tracking@w3.org<mailto:public-tracking@w3.org>>
SUBJECT: Re: ACTION-371: text defining de-identified data

But we should be equally clear that "de-identify" means more than
just removing the most obvious identifiers from the data.

On Wed, Mar 13, 2013 at 1:07 PM, Justin Brookman <justin@cdt.org<mailto:justin@cdt.org>>
wrote:

Shane is right that we did choose to use "deidentified" instead of
"unlinkable" at the Cambridge meeting. So I agree we probably
should not use "unlinkable" to define "deidentified" in the
standard. However, I don't see why we need to define "unlinkable"
at all, as it has no operational meaning, and was rejected because
it implied a technological impossibility of relinking, which is not
a standard that can be reasonably achieved.

Justin Brookman
Director, Consumer Privacy
Center for Democracy & Technology
tel 202.4!

 07.8812

[1]
justin@cdt.org<mailto:justin@cdt.org>
http://www.cdt.org [2]
@JustinBrookman
@CenDemTech

On 3/13/2013 11:35 AM, Shane Wiley wrote:

Rob,

So we're agreed unlinkability requires more processing than
de-identified - good. I would recommend we define de-identified
(nearly done) and unlinkability separately to clearly demonstrate
they are different points within a continuum. We can then focus on
the discussion of retention of data in its de-identified state
prior to moving to the ultimate unlinkable state.

- Shane

-----Original Message-----
From: Rob van Eijk [mailto:rob@blaeu.com]
Sent: Wednesday, March 13, 2013 8:28 AM
To: Shane Wiley
Cc: public-tracking@w3.org<mailto:public-tracking@w3.org>
Subject: RE: ACTION-371: text defining de-identified data

Hi Shane,

I hear you and understand your position. But unlinkable and
de-identified are not mutual

exclusive. Unlinkable data is a subset
of de-identified data, they just go through another step of
scrubbing).
Adding it to the list is not hurting your position.

The key towards the middle ground remains data retention, which has
to be proportionate to the purpose.

Rob

Shane Wiley schreef op 2013-03-13 16:13:

Rob,

I thought we had agreed to not mix the "unlinkable" term with
"de-identified" here. In our discussions in Boston it appeared there
was general agreement that unlinkability in a step beyond
de-identified. Once a record has been rendered de-identified, it can
later further be made unlinkable (using your definition of unlinkable
vs. the one I proposed). This is a significant sticking point for
those of use attempting to find middle-ground here so hopefully we can
document the details in non-normative text but I'd ask that we remove
mention of unlinkable !

 in the

definition of de-identified at this time
(or else we've not really moved forward in this discussion in my
opinion).

- Shane

-----Original Message-----
From: Rob van Eijk [mailto:rob@blaeu.com]
Sent: Wednesday, March 13, 2013 5:57 AM
To: public-tracking@w3.org<mailto:public-tracking@w3.org>
Subject: RE: ACTION-371: text defining de-identified data

Dan, Kevin,

I would really want the unlinkability in there as well. I propose to
add the text: made unlinkable

Normative text: Data can be considered sufficiently de-identified to
the extent that it has been deleted, made unlinkable, modified,
aggregated, anonymized or otherwise manipulated in order to achieve a
reasonable level of justified confidence that the data cannot
reasonably be used to infer information about, or otherwise be linked
to, a particular user, user agent, computer or device.

In terms of privacy by design, de-identifica!

 tion

through unlinkability
is the strongest form of de-identtification IMHO.

Rob

Kevin Kiley schreef op 2013-03-12 19:03:

Dan,

In case I wasn't being clear in my last post, I (personally) believe
that

User-agent should *NOT* be removed from the proposed text.

I actually don't think it would do any harm to *ADD* the word
'Computer'

as well ( which is present in the current FTC definition ) so it
reads like this…

Normative text:

Data can be considered sufficiently de-identified to the extent that
it

has been deleted, modified, aggregated, anonymized or otherwise

manipulated in order to achieve a reasonable level of justified

confidence that the data cannot reasonably be used to infer
information

about, or otherwise be linked to, a particular user, user agent,
computer or device.

I think that co!

 vers it

pretty well, and *NO* 'clarifying text' is
necessary.

Just my 2 cents.

Kevin Kiley

Previous message(s)…

Dan,

Perhaps you can add text clarifying this perspective or, much like
the FTC, suffice with "device" which I believe more than covers what
you're looking for here.

- Shane

From: Dan Auerbach [mailto:dan@eff.org]

Sent: Tuesday, March 12, 2013 8:57 AM

To: public-tracking@w3.org<mailto:public-tracking@w3.org>

Subject: Re: ACTION-371: text defining de-identified data

Shane and Kevin -- The phrase "user agent" in the text is intended to
refer to a particular user agent (not "Chrome 26" but rather "the
browser running on Dan's laptop". I hoped that would be clear from
context, but if it's not we can clarify. I may not be able to
identify your device per se, but can identify that this is the same
browser as I saw before. I think this is the case wi!

 th using

cookies,
for example. It seems more accurate to me than lumping it all under
"device", and appropriate since the text of our document is elsewhere
focused on user agents, unlike the FTC text.

Best,

Dan

On 03/12/2013 12:19 AM, Kevin Kiley wrote:

Shane Wiley wrote...
I had removed "user agent" in the suggested edit as this could be
something as generic as "Chrome 26".

It can also be something VERY specific... and tell you a LOT about
the Computer/OS/Device being used.

In the case of Mobile... it will pretty much tell you EXACTLY what
'Device' is being used.

The FTC likewise does not use "user agent" in their definition.
That's true... but BOTH definitions (W3C and FTC) currently mention
'Device'... and the FTC

reports go to great lengths about how important it is to exclude any
knowledge of 'the Device'

from the de-identified data ( especially in the case of 'Mobile
Devices' ).

Kevin Kiley

--
Edward W. Felten
Professor of Computer Science and Public Affairs
Director, Center for Information Technology Policy
Princeton University
609-258-5906 http://www.cs.princeton.edu/~felten [3]

--
Dan Auerbach
Staff Technologist
Electronic Frontier Foundation
dan@eff.org<mailto:dan@eff.org>
415 436 9333 x134


Links:
------
[1] tel:202.407.8812

[2] http://www.cdt.org

[3] http://www.cs.princeton.edu/%7Efelten








--
Dan Auerbach
Staff Technologist
Electronic Frontier Foundation
dan@eff.org<mailto:dan@eff.org>
415 436 9333 x134
Received on Friday, 15 March 2013 18:16:43 UTC