Re: ACTION-371: text defining de-identified data from Rob van Eijk on 2013-03-15 (public-tracking@w3.org from March 2013)

From: Rob van Eijk <rob@blaeu.com>
Date: Fri, 15 Mar 2013 18:47:15 +0100
To: Dan Auerbach <dan@eff.org>, public-tracking@w3.org, Shane Wiley <wileys@yahoo-inc.com>
Message-ID: <1a7e924e-cc41-4111-953c-49213dd964f8@email.android.com>
Dan,

Thanks for the thoughtfull reply.
I understand now that we are on the same page. 

But I doubt that Shane is on that same page as well. If I understand Shane's position correctly, his view on de-identified does not come close to the green as I would like it to be. I just want to be absolutely sure that there is no wiggle-room in what it means to reach de-identified.

@Shane: what is your view, taking into account the rely from Dan?

Rob



Dan Auerbach <dan@eff.org> wrote:

>My view is that we do NOT need to define a third state of data. We have
>green and red now. If a compelling argument is made that an orange
>state
>is needed, we can revisit, but I think that existing permitted uses
>plus
>having a small time frame for processing raw event data are strong
>enough protections to not warrant this third state. Second, regarding
>nomenclature, the FTC definition actually defines unlinkability in
>terms
>of de-identification, so I think it would be very confusing to stray
>too
>far from that definitional framework.
>
>A couple further replies inline:
>
>On 03/14/2013 04:09 AM, Justin Brookman wrote:
>> OK, but as I said before, the standard does not currently envision
>> three states of data.  As written, all data pertaining to a network
>> communication is in scope, unless it is deidentified,* in which case
>> it is out of scope.  You need to propose a third consequence for a
>new
>> class of data for this to have effect.
>>
>> * Noting that there is still ongoing discussion about what
>> "deidentified" actually means, as evidenced by the recent emails from
>> Ed, Shane, and Dan.
>>
>> Justin Brookman
>> Director, Consumer Privacy
>> Center for Democracy & Technology
>> tel 202.407.8812
>> justin@cdt.org
>> http://www.cdt.org
>> @JustinBrookman
>> @CenDemTech
>>
>> On 3/14/2013 5:39 AM, Rob van Eijk wrote:
>>>
>>>
>>> In Boston Shane and I discussed the process of de-identification by
>>> applying it to my mental model (red, orange and green data). Red
>data
>>> is raw event level data (eg log files with unique identifiers),
>>> orange is still linkable but de-identified data, green is unlinkable
>>> and therefore anonymous data.
>>>
>>> We agreed that in order to move from red to orange, or from orange
>to
>>> green, one needs to pass the barriers by processing. As seen in the
>>> de-identrification workshop there are multiple ways to do that. I
>>> illustrated 2 alternative practices:
>>>
>>> 1. One example is based on concatenating a random number to the
>>> unique ID. This results in a lookup table of unique ID <-> random
>>> number.
>>> Getting from orange to red is braking the link (un-linkiability) by
>>> throwing away the unique ID. No new red data can be linked to the
>>> un-linkable data in the green.
>I think the trouble with this model is the assumption that the unique
>ID
>will be the only means of identifying someone. If you'll allow me to
>stick with the conceptual framework of a table for simplicity (think
>mysql table or bigtable), I think we should get away from the mentality
>that there are "identifiers" -- fields like udids, cookies, IPs, phone
>numbers etc. Instead, it is more accurate to say that *every* field of
>a
>data set provides some bits of identifying information.
>
>An "orange" data set as you describe might still be super identifying,
>if, for example, it is a wide table with lots of fields. As a concrete
>example, URLs can be very identifying in some cases, as can timestamps.
>Even data that you describe as "green" could still be identifying, if I
>understand you correctly. In many instances, having events linked by a
>random irreversible identifier (e.g. discarded salt) is simply not
>enough to ensure that information can't be reasonably obtained about
>users. In some cases it might be, but it depends a lot on that nature
>of
>the rest of the data in the table.
>
>>>
>>>
>>> 2. The other example is based on rotating hashes. Getting from red
>to
>>> orange is applying the hash. Getting from orange to green is braking
>>> the link (un-linkability) by throwing away the salt. No new red data
>>> can be linked to the un-linkable data in the green.
>>>
>>>
>>>
>>> So I am willing to give up the word unlinkable in the normative
>>> de-identification text, but in exchange non-normative examples
>should
>>> be added.
>I think it's a good suggestion to say that the non-normative examples
>should be fleshed out. But I agree that they should suggest a stronger
>version of "green" than I understand from your mental model above
>(which
>I hope I'm getting right).
>
>
>>>
>>>
>>>
>>>
>>> <non-normative text)
>>> De-identification can be accomplished by applying a mental model
>>> (red, orange and green data). Red data is raw event level data (eg
>>> log files with unique identifiers), orange is still linkable but
>>> de-identified data, green is unlinkable and therefore anonymous
>data.
>>>
>>> In order to move from red to orange, or from orange to green, one
>>> needs to pass the barriers by processing. There are multiple ways to
>>> do that:
>>>
>>> 1. One example is based on concatenating a random number to the
>>> unique ID. This results in a lookup table of unique ID <-> random
>>> number.
>>> Getting from orange to red is braking the link (un-linkiability) by
>>> throwing away the unique ID. No new red data can be linked to the
>>> un-linkable data in the green.
>>>
>>> 2. Another example is based on rotating hashes. Getting from red to
>>> orange is applying the hash. Getting from orange to green is braking
>>> the link (un-linkability) by throwing away the salt. No new red data
>>> can be linked to the un-linkable data in the green.
>>> </non-normative text)
>>>
>>>
>>> Rob
>>>
>>>
>>> Dan Auerbach schreef op 2013-03-13 19:01:
>>>> I also agree that we should just stick with de-identified, just as
>a
>>>> point of nomenclature. For one, unlike what you propose below, Rob,
>>>> the FTC text actually defines unlinkability in terms of
>>>> de-identification, so I think it would be very confusing if we did
>the
>>>> opposite here.
>>>>
>>>>  That said, we did NOT agree at the face-to-face that unlinkability
>>>> was a "step beyond de-identified"; we are not at all weakening the
>>>> standard with our word choice. For unlinkability and
>de-identification
>>>> both, we do NOT propose a holy grail of provably perfect
>anonymization
>>>> that can't be achieved in practice (or even in theory, really!).
>>>> However, for both we require a significantly higher standard than,
>for
>>>> example, keeping a pseudonymous data set of browsing history. The
>>>> first non-normative example is intended to make this clear, but I
>can
>>>> flesh it out if it's not.
>>>>
>>>>  Dan
>>>>
>>>>  On 03/13/2013 10:28 AM, Shane Wiley wrote:
>>>>
>>>>> Ed,
>>>>>
>>>>> Agreed - reasonably attempting to clear unique identifiers or
>>>>> information that could lead to unique identification in URLs
>should
>>>>> also be included.
>>>>>
>>>>> - Shane
>>>>>
>>>>> FROM: Edward W. Felten [mailto:felten@CS.Princeton.EDU]
>>>>> SENT: Wednesday, March 13, 2013 10:22 AM
>>>>> TO: Justin Brookman
>>>>> CC: <public-tracking@w3.org>
>>>>> SUBJECT: Re: ACTION-371: text defining de-identified data
>>>>>
>>>>> But we should be equally clear that "de-identify" means more than
>>>>> just removing the most obvious identifiers from the data.
>>>>>
>>>>> On Wed, Mar 13, 2013 at 1:07 PM, Justin Brookman <justin@cdt.org>
>>>>> wrote:
>>>>>
>>>>> Shane is right that we did choose to use "deidentified" instead of
>>>>> "unlinkable" at the Cambridge meeting. So I agree we probably
>>>>> should not use "unlinkable" to define "deidentified" in the
>>>>> standard. However, I don't see why we need to define "unlinkable"
>>>>> at all, as it has no operational meaning, and was rejected because
>>>>> it implied a technological impossibility of relinking, which is
>not
>>>>> a standard that can be reasonably achieved.
>>>>>
>>>>> Justin Brookman
>>>>> Director, Consumer Privacy
>>>>> Center for Democracy & Technology
>>>>> tel 202.407.8812 [1]
>>>>> justin@cdt.org
>>>>> http://www.cdt.org [2]
>>>>> @JustinBrookman
>>>>> @CenDemTech
>>>>>
>>>>> On 3/13/2013 11:35 AM, Shane Wiley wrote:
>>>>>
>>>>> Rob,
>>>>>
>>>>> So we're agreed unlinkability requires more processing than
>>>>> de-identified - good. I would recommend we define de-identified
>>>>> (nearly done) and unlinkability separately to clearly demonstrate
>>>>> they are different points within a continuum. We can then focus on
>>>>> the discussion of retention of data in its de-identified state
>>>>> prior to moving to the ultimate unlinkable state.
>>>>>
>>>>> - Shane
>>>>>
>>>>> -----Original Message-----
>>>>> From: Rob van Eijk [mailto:rob@blaeu.com]
>>>>> Sent: Wednesday, March 13, 2013 8:28 AM
>>>>> To: Shane Wiley
>>>>> Cc: public-tracking@w3.org
>>>>> Subject: RE: ACTION-371: text defining de-identified data
>>>>>
>>>>> Hi Shane,
>>>>>
>>>>> I hear you and understand your position. But unlinkable and
>>>>> de-identified are not mutual exclusive. Unlinkable data is a
>subset
>>>>> of de-identified data, they just go through another step of
>>>>> scrubbing).
>>>>> Adding it to the list is not hurting your position.
>>>>>
>>>>> The key towards the middle ground remains data retention, which
>has
>>>>> to be proportionate to the purpose.
>>>>>
>>>>> Rob
>>>>>
>>>>> Shane Wiley schreef op 2013-03-13 16:13:
>>>>>
>>>>> Rob,
>>>>>
>>>>> I thought we had agreed to not mix the "unlinkable" term with
>>>>> "de-identified" here. In our discussions in Boston it appeared
>there
>>>>> was general agreement that unlinkability in a step beyond
>>>>> de-identified. Once a record has been rendered de-identified, it
>can
>>>>> later further be made unlinkable (using your definition of
>unlinkable
>>>>> vs. the one I proposed). This is a significant sticking point for
>>>>> those of use attempting to find middle-ground here so hopefully we
>can
>>>>> document the details in non-normative text but I'd ask that we
>remove
>>>>> mention of unlinkable in the definition of de-identified at this
>time
>>>>> (or else we've not really moved forward in this discussion in my
>>>>> opinion).
>>>>>
>>>>> - Shane
>>>>>
>>>>> -----Original Message-----
>>>>> From: Rob van Eijk [mailto:rob@blaeu.com]
>>>>> Sent: Wednesday, March 13, 2013 5:57 AM
>>>>> To: public-tracking@w3.org
>>>>> Subject: RE: ACTION-371: text defining de-identified data
>>>>>
>>>>> Dan, Kevin,
>>>>>
>>>>> I would really want the unlinkability in there as well. I propose
>to
>>>>> add the text: made unlinkable
>>>>>
>>>>> Normative text: Data can be considered sufficiently de-identified
>to
>>>>> the extent that it has been deleted, made unlinkable, modified,
>>>>> aggregated, anonymized or otherwise manipulated in order to
>achieve a
>>>>> reasonable level of justified confidence that the data cannot
>>>>> reasonably be used to infer information about, or otherwise be
>linked
>>>>> to, a particular user, user agent, computer or device.
>>>>>
>>>>> In terms of privacy by design, de-identification through
>unlinkability
>>>>> is the strongest form of de-identtification IMHO.
>>>>>
>>>>> Rob
>>>>>
>>>>> Kevin Kiley schreef op 2013-03-12 19:03:
>>>>>
>>>>> Dan,
>>>>>
>>>>> In case I wasn't being clear in my last post, I (personally)
>believe
>>>>> that
>>>>>
>>>>> User-agent should *NOT* be removed from the proposed text.
>>>>>
>>>>> I actually don't think it would do any harm to *ADD* the word
>>>>> 'Computer'
>>>>>
>>>>> as well ( which is present in the current FTC definition ) so it
>>>>> reads like this…
>>>>>
>>>>> Normative text:
>>>>>
>>>>> Data can be considered sufficiently de-identified to the extent
>that
>>>>> it
>>>>>
>>>>> has been deleted, modified, aggregated, anonymized or otherwise
>>>>>
>>>>> manipulated in order to achieve a reasonable level of justified
>>>>>
>>>>> confidence that the data cannot reasonably be used to infer
>>>>> information
>>>>>
>>>>> about, or otherwise be linked to, a particular user, user agent,
>>>>> computer or device.
>>>>>
>>>>> I think that covers it pretty well, and *NO* 'clarifying text' is
>>>>> necessary.
>>>>>
>>>>> Just my 2 cents.
>>>>>
>>>>> Kevin Kiley
>>>>>
>>>>> Previous message(s)…
>>>>>
>>>>> Dan,
>>>>>
>>>>> Perhaps you can add text clarifying this perspective or, much like
>>>>> the FTC, suffice with "device" which I believe more than covers
>what
>>>>> you're looking for here.
>>>>>
>>>>> - Shane
>>>>>
>>>>> From: Dan Auerbach [mailto:dan@eff.org]
>>>>>
>>>>> Sent: Tuesday, March 12, 2013 8:57 AM
>>>>>
>>>>> To: public-tracking@w3.org
>>>>>
>>>>> Subject: Re: ACTION-371: text defining de-identified data
>>>>>
>>>>> Shane and Kevin -- The phrase "user agent" in the text is intended
>to
>>>>> refer to a particular user agent (not "Chrome 26" but rather "the
>>>>> browser running on Dan's laptop". I hoped that would be clear from
>>>>> context, but if it's not we can clarify. I may not be able to
>>>>> identify your device per se, but can identify that this is the
>same
>>>>> browser as I saw before. I think this is the case with using
>cookies,
>>>>> for example. It seems more accurate to me than lumping it all
>under
>>>>> "device", and appropriate since the text of our document is
>elsewhere
>>>>> focused on user agents, unlike the FTC text.
>>>>>
>>>>> Best,
>>>>>
>>>>> Dan
>>>>>
>>>>> On 03/12/2013 12:19 AM, Kevin Kiley wrote:
>>>>>
>>>>>> Shane Wiley wrote...
>>>>>> I had removed "user agent" in the suggested edit as this could be
>>>>>> something as generic as "Chrome 26".
>>>>>
>>>>> It can also be something VERY specific... and tell you a LOT about
>>>>> the Computer/OS/Device being used.
>>>>>
>>>>> In the case of Mobile... it will pretty much tell you EXACTLY what
>>>>> 'Device' is being used.
>>>>>
>>>>>> The FTC likewise does not use "user agent" in their definition.
>>>>>
>>>>> That's true... but BOTH definitions (W3C and FTC) currently
>mention
>>>>> 'Device'... and the FTC
>>>>>
>>>>> reports go to great lengths about how important it is to exclude
>any
>>>>> knowledge of 'the Device'
>>>>>
>>>>> from the de-identified data ( especially in the case of 'Mobile
>>>>> Devices' ).
>>>>>
>>>>> Kevin Kiley
>>>>>
>>>>> -- 
>>>>> Edward W. Felten
>>>>> Professor of Computer Science and Public Affairs
>>>>> Director, Center for Information Technology Policy
>>>>> Princeton University
>>>>> 609-258-5906 http://www.cs.princeton.edu/~felten [3]
>>>>
>>>> -- 
>>>> Dan Auerbach
>>>> Staff Technologist
>>>> Electronic Frontier Foundation
>>>> dan@eff.org
>>>> 415 436 9333 x134
>>>>
>>>>
>>>> Links:
>>>> ------
>>>> [1] tel:202.407.8812
>>>> [2] http://www.cdt.org
>>>> [3] http://www.cs.princeton.edu/%7Efelten
>>>
>>>
>>>
>>
>>
>>
>
>
>-- 
>Dan Auerbach
>Staff Technologist
>Electronic Frontier Foundation
>dan@eff.org
>415 436 9333 x134
Received on Friday, 15 March 2013 17:47:54 UTC