RE: ACTION-371: text defining de-identified data from Rob van Eijk on 2013-03-15 (public-tracking@w3.org from March 2013)

From: Rob van Eijk <rob@blaeu.com>
Date: Fri, 15 Mar 2013 22:59:49 +0100
To: Shane Wiley <wileys@yahoo-inc.com>, Dan Auerbach <dan@eff.org>, "public-tracking@w3.org" <public-tracking@w3.org>
Message-ID: <4e6f0553-085e-4758-96bd-0185ebe83dde@email.android.com>
Hi Shane,

I would love to embrace the outcome that you describe: that information that has been de-identified not later become identified. So maybe it is due to my lack of grasping the true nature of the approach. Please explain how the following scenario applies to de-identified:

Use case: booking a hotel room after being retargeted:
A user visits a site and looks for info about available hotel rooms in city A for date B. Now browsing the web the user is being confronted by re-targeted personalized ads showing hotel offers in city A at date B. I understand that up to this point data can be scrubbed to be de-identified. But when the user decides to act on the offer, and makes a reservation, a reservation-ID (plus identifyable information) will be tied together with the de-identified data. How does this play under DNT.

Please walk me through for I really want to be sure if this may fly.

Rob

Shane Wiley <wileys@yahoo-inc.com> wrote:

>Rob,
>
>“no wiggle-room” – this is my core concern with some of this direction.
>The current definition relies on terms such as “reasonable” (matches up
>well with EU concepts of “likely reasonable”).  Much like HIPPA, this
>gives us a risk-based model to de-identification management.  If an
>organization states its W3C DNT compliant and articulates their
>de-identification process, I believe it’s important to provide
>“wiggle-room” for organizations to implement de-identification in a
>manner they see appropriate to their particular business model,
>technical tools, administrative and operational processes.  The
>important outcome is that information that has been de-identified not
>later become identified.  If an organization is willing to make that
>public claim and they later prove unable to follow-through on their
>commitment, local legal remedies will take over from there.
>
>As I stated in Berlin, I believe notions of red, yellow, and green are
>problematic as they bring a judgmental lens to these states (red =
>danger, yellow = caution).  I agree with Dan that there should only be
>two states: raw and de-identified.
>
>- Shane
>
>From: Rob van Eijk [mailto:rob@blaeu.com]
>Sent: Friday, March 15, 2013 10:47 AM
>To: Dan Auerbach; public-tracking@w3.org; Shane Wiley
>Subject: Re: ACTION-371: text defining de-identified data
>
>
>Dan,
>
>Thanks for the thoughtfull reply.
>I understand now that we are on the same page.
>
>But I doubt that Shane is on that same page as well. If I understand
>Shane's position correctly, his view on de-identified does not come
>close to the green as I would like it to be. I just want to be
>absolutely sure that there is no wiggle-room in what it means to reach
>de-identified.
>
>@Shane: what is your view, taking into account the rely from Dan?
>
>Rob
>
>
>Dan Auerbach <dan@eff.org<mailto:dan@eff.org>> wrote:
>
>My view is that we do NOT need to define a third state of data. We have
>green and red now. If a compelling argument is made that an orange
>state
>is needed, we can revisit, but I think that existing permitted uses
>plus
>having a small time frame for processing raw event data are strong
>enough protections to not warrant this third state. Second, regarding
>nomenclature, the FTC definition actually defines unlinkability in
>terms
>of de-identification, so I think it would be very confusing to stray
>too
>far from that definitional framework.
>
>A couple further replies inline:
>
>On 03/14/2013 04:09 AM, Justin Brookman wrote:
>
>OK, but as I said before, the standard does not currently envision
>three states of dat!
>
> a.  As
>
>written, all data pertaining to a network
>communication is in scope, unless it is deidentified,* in which case
>it is out of scope.  You need to propose a third consequence for a new
>class of data for this to have effect.
>
>* Noting that there is still ongoing discussion about what
>"deidentified" actually means, as evidenced by the recent emails from
>Ed, Shane, and Dan.
>
>Justin Brookman
>Director, Consumer Privacy
>Center for Democracy & Technology
>tel 202.407.8812
>justin@cdt.org<mailto:justin@cdt.org>
>http://www.cdt.org
>@JustinBrookman
>@CenDemTech
>
>On 3/14/2013 5:39 AM, Rob van Eijk wrote:
>
>
>In Boston Shane and I discussed the process of de-identification by
>applying it to my mental model (red, orange and green data). Red data
>is raw e!
>
> vent
>
>level data (eg log files with unique identifiers),
>orange is still linkable but de-identified data, green is unlinkable
>and therefore anonymous data.
>
>We agreed that in order to move from red to orange, or from orange to
>green, one needs to pass the barriers by processing. As seen in the
>de-identrification workshop there are multiple ways to do that. I
>illustrated 2 alternative practices:
>
>1. One example is based on concatenating a random number to the
>unique ID. This results in a lookup table of unique ID <-> random
>number.
>Getting from orange to red is braking the link (un-linkiability) by
>throwing away the unique ID. No new red data can be linked to the
>un-linkable data in the green.
>
>I think the trouble with this model is the assumption that the unique
>ID
>will be the only means of identifying someone. If you'll allow me to
>stick with the conceptual framewor!
>
> k of a
>
>table for simplicity (think
>mysql table or bigtable), I think we should get away from the mentality
>that there are "identifiers" -- fields like udids, cookies, IPs, phone
>numbers etc. Instead, it is more accurate to say that *every* field of
>a
>data set provides some bits of identifying information.
>
>An "orange" data set as you describe might still be super identifying,
>if, for example, it is a wide table with lots of fields. As a concrete
>example, URLs can be very identifying in some cases, as can timestamps.
>Even data that you describe as "green" could still be identifying, if I
>understand you correctly. In many instances, having events linked by a
>random irreversible identifier (e.g. discarded salt) is simply not
>enough to ensure that information can't be reasonably obtained about
>users. In some cases it might be, but it depends a lot on that nature
>of
>the rest of the data in the table.
>
>
>!
>
>
>
>2. The other example is based on rotating hashes. Getting from red to
>orange is applying the hash. Getting from orange to green is braking
>the link (un-linkability) by throwing away the salt. No new red data
>can be linked to the un-linkable data in the green.
>
>
>
>So I am willing to give up the word unlinkable in the normative
>de-identification text, but in exchange non-normative examples should
>be added.
>
>I think it's a good suggestion to say that the non-normative examples
>should be fleshed out. But I agree that they should suggest a stronger
>version of "green" than I understand from your mental model above
>(which
>I hope I'm getting right).
>
>
>
>
>
><
>
>non-normative text)
>De-identification can be accomplished by applying a mental model
>(red, orange and green data). Red data is raw event level data (eg
>log files with unique identifiers), orange is still linkable but
>de-identified data, green is unlinkable and therefore anonymous data.
>
>In order to move from red to orange, or from orange to green, one
>needs to pass the barriers by processing. There are multiple ways to
>do that:
>
>1. One example is based on concatenating a random number to the
>unique ID. This results in a lookup table of unique ID <-> random
>number.
>Getting from orange to red is braking the link (un-linkiability) by
>throwing away !
>
> the
>
>unique ID. No new red data can be linked to the
>un-linkable data in the green.
>
>2. Another example is based on rotating hashes. Getting from red to
>orange is applying the hash. Getting from orange to green is braking
>the link (un-linkability) by throwing away the salt. No new red data
>can be linked to the un-linkable data in the green.
></non-normative text)
>
>
>Rob
>
>
>Dan Auerbach schreef op 2013-03-13 19:01:
>
>I also agree that we should just stick with de-identified, just as a
>point of nomenclature. For one, unlike what you propose below, Rob,
>the FTC text actually defines unlinkability in terms of
>de-identification, so I think it would be very confusing if we did the
>opposite here.
>
>That said, we did NOT agree at the face-to-face that unlinkability
>was a !
>
> "step
>
>beyond de-identified"; we are not at all weakening the
>standard with our word choice. For unlinkability and de-identification
>both, we do NOT propose a holy grail of provably perfect anonymization
>that can't be achieved in practice (or even in theory, really!).
>However, for both we require a significantly higher standard than, for
>example, keeping a pseudonymous data set of browsing history. The
>first non-normative example is intended to make this clear, but I can
>flesh it out if it's not.
>
>Dan
>
>On 03/13/2013 10:28 AM, Shane Wiley wrote:
>
>Ed,
>
>Agreed - reasonably attempting to clear unique identifiers or
>information that could lead to unique identification in URLs should
>also be included.
>
>- Shane
>
>FROM: Edward W. Felten [mailto:felten@CS.Princeton.EDU]SENT:
>
>Wednesday, March 13, 2013 10:22 AM
>TO: Justin Brookman
>CC: <public-tracking@w3.org<mailto:public-tracking@w3.org>>
>SUBJECT: Re: ACTION-371: text defining de-identified data
>
>But we should be equally clear that "de-identify" means more than
>just removing the most obvious identifiers from the data.
>
>On Wed, Mar 13, 2013 at 1:07 PM, Justin Brookman
><justin@cdt.org<mailto:justin@cdt.org>>
>wrote:
>
>Shane is right that we did choose to use "deidentified" instead of
>"unlinkable" at the Cambridge meeting. So I agree we probably
>should not use "unlinkable" to define "deidentified" in the
>standard. However, I don't see why we need to define "unlinkable"
>at all, as it has no operational meaning, and was rejected because
>it implied a technological impossibility of relinking, which is not
>a standard that can be reasonably achieved.
>
>Justin Brookman
>Director, Consumer Privacy
>Center for Democracy & Technology
>tel 202.4!
>
> 07.8812
>
>[1]
>justin@cdt.org<mailto:justin@cdt.org>
>http://www.cdt.org [2]
>@JustinBrookman
>@CenDemTech
>
>On 3/13/2013 11:35 AM, Shane Wiley wrote:
>
>Rob,
>
>So we're agreed unlinkability requires more processing than
>de-identified - good. I would recommend we define de-identified
>(nearly done) and unlinkability separately to clearly demonstrate
>they are different points within a continuum. We can then focus on
>the discussion of retention of data in its de-identified state
>prior to moving to the ultimate unlinkable state.
>
>- Shane
>
>-----Original Message-----
>From: Rob van Eijk [mailto:rob@blaeu.com]
>Sent: Wednesday, March 13, 2013 8:28 AM
>To: Shane Wiley
>Cc: public-tracking@w3.org<mailto:public-tracking@w3.org>
>Subject: RE: ACTION-371: text defining de-identified data
>
>Hi Shane,
>
>I hear you and understand your position. But unlinkable and
>de-identified are not mutual
>
>exclusive. Unlinkable data is a subset
>of de-identified data, they just go through another step of
>scrubbing).
>Adding it to the list is not hurting your position.
>
>The key towards the middle ground remains data retention, which has
>to be proportionate to the purpose.
>
>Rob
>
>Shane Wiley schreef op 2013-03-13 16:13:
>
>Rob,
>
>I thought we had agreed to not mix the "unlinkable" term with
>"de-identified" here. In our discussions in Boston it appeared there
>was general agreement that unlinkability in a step beyond
>de-identified. Once a record has been rendered de-identified, it can
>later further be made unlinkable (using your definition of unlinkable
>vs. the one I proposed). This is a significant sticking point for
>those of use attempting to find middle-ground here so hopefully we can
>document the details in non-normative text but I'd ask that we remove
>mention of unlinkable !
>
> in the
>
>definition of de-identified at this time
>(or else we've not really moved forward in this discussion in my
>opinion).
>
>- Shane
>
>-----Original Message-----
>From: Rob van Eijk [mailto:rob@blaeu.com]
>Sent: Wednesday, March 13, 2013 5:57 AM
>To: public-tracking@w3.org<mailto:public-tracking@w3.org>
>Subject: RE: ACTION-371: text defining de-identified data
>
>Dan, Kevin,
>
>I would really want the unlinkability in there as well. I propose to
>add the text: made unlinkable
>
>Normative text: Data can be considered sufficiently de-identified to
>the extent that it has been deleted, made unlinkable, modified,
>aggregated, anonymized or otherwise manipulated in order to achieve a
>reasonable level of justified confidence that the data cannot
>reasonably be used to infer information about, or otherwise be linked
>to, a particular user, user agent, computer or device.
>
>In terms of privacy by design, de-identifica!
>
> tion
>
>through unlinkability
>is the strongest form of de-identtification IMHO.
>
>Rob
>
>Kevin Kiley schreef op 2013-03-12 19:03:
>
>Dan,
>
>In case I wasn't being clear in my last post, I (personally) believe
>that
>
>User-agent should *NOT* be removed from the proposed text.
>
>I actually don't think it would do any harm to *ADD* the word
>'Computer'
>
>as well ( which is present in the current FTC definition ) so it
>reads like this…
>
>Normative text:
>
>Data can be considered sufficiently de-identified to the extent that
>it
>
>has been deleted, modified, aggregated, anonymized or otherwise
>
>manipulated in order to achieve a reasonable level of justified
>
>confidence that the data cannot reasonably be used to infer
>information
>
>about, or otherwise be linked to, a particular user, user agent,
>computer or device.
>
>I think that co!
>
> vers it
>
>pretty well, and *NO* 'clarifying text' is
>necessary.
>
>Just my 2 cents.
>
>Kevin Kiley
>
>Previous message(s)…
>
>Dan,
>
>Perhaps you can add text clarifying this perspective or, much like
>the FTC, suffice with "device" which I believe more than covers what
>you're looking for here.
>
>- Shane
>
>From: Dan Auerbach [mailto:dan@eff.org]
>
>Sent: Tuesday, March 12, 2013 8:57 AM
>
>To: public-tracking@w3.org<mailto:public-tracking@w3.org>
>
>Subject: Re: ACTION-371: text defining de-identified data
>
>Shane and Kevin -- The phrase "user agent" in the text is intended to
>refer to a particular user agent (not "Chrome 26" but rather "the
>browser running on Dan's laptop". I hoped that would be clear from
>context, but if it's not we can clarify. I may not be able to
>identify your device per se, but can identify that this is the same
>browser as I saw before. I think this is the case wi!
>
> th using
>
>cookies,
>for example. It seems more accurate to me than lumping it all under
>"device", and appropriate since the text of our document is elsewhere
>focused on user agents, unlike the FTC text.
>
>Best,
>
>Dan
>
>On 03/12/2013 12:19 AM, Kevin Kiley wrote:
>
>Shane Wiley wrote...
>I had removed "user agent" in the suggested edit as this could be
>something as generic as "Chrome 26".
>
>It can also be something VERY specific... and tell you a LOT about
>the Computer/OS/Device being used.
>
>In the case of Mobile... it will pretty much tell you EXACTLY what
>'Device' is being used.
>
>The FTC likewise does not use "user agent" in their definition.
>That's true... but BOTH definitions (W3C and FTC) currently mention
>'Device'... and the FTC
>
>reports go to great lengths about how important it is to exclude any
>knowledge of 'the Device'
>
>from the de-identified data ( especially in the case of 'Mobile
>Devices' ).
>
>Kevin Kiley
>
>--
>Edward W. Felten
>Professor of Computer Science and Public Affairs
>Director, Center for Information Technology Policy
>Princeton University
>609-258-5906 http://www.cs.princeton.edu/~felten [3]
>
>--
>Dan Auerbach
>Staff Technologist
>Electronic Frontier Foundation
>dan@eff.org<mailto:dan@eff.org>
>415 436 9333 x134
>
>
>Links:
>------
>[1] tel:202.407.8812
>[2] http://www.cdt.org
>[3] http://www.cs.princeton.edu/%7Efelten
>
>
>
>
>
>
>
>--
>Dan Auerbach
>Staff Technologist
>Electronic Frontier Foundation
>dan@eff.org<mailto:dan@eff.org>
>415 436 9333 x134
Received on Friday, 15 March 2013 22:00:29 UTC