Re: ACTION-371: text defining de-identified data from Rob van Eijk on 2013-03-14 (public-tracking@w3.org from March 2013)

From: Rob van Eijk <rob@blaeu.com>
Date: Thu, 14 Mar 2013 13:08:18 +0100
To: <rob@blaeu.com>
Cc: Justin Brookman <justin@cdt.org>, <public-tracking@w3.org>
Message-ID: <7cff37d5779fcd900b42efb99a06e674@xs4all.nl>
typo, sorry.

The reason why this is useful for Comcast/Nielsen, is that data 
aggregation could be done in a 24-hour window in the ORANGE domain,
  and after 24 hours data will be kept in the GREEN domain. Being able 
to express this with DNT compliance will remove (in my view) the 
necessity for an exception for audience science.


Rob

Rob van Eijk schreef op 2013-03-14 13:01:
> OK, so for the formality of the group process: I propose a three state 
> of data.
> 
> The reason why this is useful for Comcast/Nielsen, is that data
> aggregation could be done in a 24-hour window in the orange domain,
> and after 24 hours data will be kept in the red domain.
> Being able to express this with DNT compliance will remove (in my
> view) the necessity for an exception for audience science.
> 
> Rob
> 
> 
> 
> Justin Brookman schreef op 2013-03-14 12:09:
>> OK, but as I said before, the standard does not currently envision
>> three states of data.  As written, all data pertaining to a network
>> communication is in scope, unless it is deidentified,* in which case
>> it is out of scope.  You need to propose a third consequence for a 
>> new
>> class of data for this to have effect.
>> * Noting that there is still ongoing discussion about what
>> "deidentified" actually means, as evidenced by the recent emails from
>> Ed, Shane, and Dan.
>> Justin Brookman
>> Director, Consumer Privacy
>> Center for Democracy & Technology
>> tel 202.407.8812
>> justin@cdt.org
>> http://www.cdt.org
>> @JustinBrookman
>> @CenDemTech
>> On 3/14/2013 5:39 AM, Rob van Eijk wrote:
>>> 
>>> In Boston Shane and I discussed the process of de-identification by 
>>> applying it to my mental model (red, orange and green data). Red data 
>>> is raw event level data (eg log files with unique identifiers), 
>>> orange is still linkable but de-identified data, green is unlinkable 
>>> and therefore anonymous data.
>>> We agreed that in order to move from red to orange, or from orange 
>>> to green, one needs to pass the barriers by processing. As seen in 
>>> the de-identrification workshop there are multiple ways to do that. I 
>>> illustrated 2 alternative practices:
>>> 1. One example is based on concatenating a random number to the 
>>> unique ID. This results in a lookup table of unique ID <-> random 
>>> number.
>>> Getting from orange to red is braking the link (un-linkiability) by 
>>> throwing away the unique ID. No new red data can be linked to the 
>>> un-linkable data in the green.
>>> 2. The other example is based on rotating hashes. Getting from red 
>>> to orange is applying the hash. Getting from orange to green is 
>>> braking the link (un-linkability) by throwing away the salt. No new 
>>> red data can be linked to the un-linkable data in the green.
>>> 
>>> 
>>> So I am willing to give up the word unlinkable in the normative 
>>> de-identification text, but in exchange non-normative examples should 
>>> be added.
>>> 
>>> 
>>> <non-normative text)
>>> De-identification can be accomplished by applying a mental model 
>>> (red, orange and green data). Red data is raw event level data (eg 
>>> log files with unique identifiers), orange is still linkable but 
>>> de-identified data, green is unlinkable and therefore anonymous data.
>>> In order to move from red to orange, or from orange to green, one 
>>> needs to pass the barriers by processing. There are multiple ways to 
>>> do that:
>>> 1. One example is based on concatenating a random number to the 
>>> unique ID. This results in a lookup table of unique ID <-> random 
>>> number.
>>> Getting from orange to red is braking the link (un-linkiability) by 
>>> throwing away the unique ID. No new red data can be linked to the 
>>> un-linkable data in the green.
>>> 2. Another example is based on rotating hashes. Getting from red to 
>>> orange is applying the hash. Getting from orange to green is braking 
>>> the link (un-linkability) by throwing away the salt. No new red data 
>>> can be linked to the un-linkable data in the green.
>>> </non-normative text)
>>> 
>>> Rob
>>> 
>>> Dan Auerbach schreef op 2013-03-13 19:01:
>>>> I also agree that we should just stick with de-identified, just as 
>>>> a
>>>> point of nomenclature. For one, unlike what you propose below, Rob,
>>>> the FTC text actually defines unlinkability in terms of
>>>> de-identification, so I think it would be very confusing if we did 
>>>> the
>>>> opposite here.
>>>> That said, we did NOT agree at the face-to-face that unlinkability
>>>> was a "step beyond de-identified"; we are not at all weakening the
>>>> standard with our word choice. For unlinkability and 
>>>> de-identification
>>>> both, we do NOT propose a holy grail of provably perfect 
>>>> anonymization
>>>> that can't be achieved in practice (or even in theory, really!).
>>>> However, for both we require a significantly higher standard than, 
>>>> for
>>>> example, keeping a pseudonymous data set of browsing history. The
>>>> first non-normative example is intended to make this clear, but I 
>>>> can
>>>> flesh it out if it's not.
>>>> Dan
>>>> On 03/13/2013 10:28 AM, Shane Wiley wrote:
>>>> 
>>>>> Ed,
>>>>> Agreed - reasonably attempting to clear unique identifiers or 
>>>>> information that could lead to unique identification in URLs should 
>>>>> also be included.
>>>>> - Shane
>>>>> FROM: Edward W. Felten [mailto:felten@CS.Princeton.EDU]
>>>>> SENT: Wednesday, March 13, 2013 10:22 AM
>>>>> TO: Justin Brookman
>>>>> CC: <public-tracking@w3.org>
>>>>> SUBJECT: Re: ACTION-371: text defining de-identified data
>>>>> But we should be equally clear that "de-identify" means more than 
>>>>> just removing the most obvious identifiers from the data.
>>>>> On Wed, Mar 13, 2013 at 1:07 PM, Justin Brookman <justin@cdt.org> 
>>>>> wrote:
>>>>> Shane is right that we did choose to use "deidentified" instead of 
>>>>> "unlinkable" at the Cambridge meeting. So I agree we probably 
>>>>> should not use "unlinkable" to define "deidentified" in the 
>>>>> standard. However, I don't see why we need to define "unlinkable" 
>>>>> at all, as it has no operational meaning, and was rejected because 
>>>>> it implied a technological impossibility of relinking, which is not 
>>>>> a standard that can be reasonably achieved.
>>>>> Justin Brookman
>>>>> Director, Consumer Privacy
>>>>> Center for Democracy & Technology
>>>>> tel 202.407.8812 [1]
>>>>> justin@cdt.org
>>>>> http://www.cdt.org [2]
>>>>> @JustinBrookman
>>>>> @CenDemTech
>>>>> On 3/13/2013 11:35 AM, Shane Wiley wrote:
>>>>> Rob,
>>>>> So we're agreed unlinkability requires more processing than 
>>>>> de-identified - good. I would recommend we define de-identified 
>>>>> (nearly done) and unlinkability separately to clearly demonstrate 
>>>>> they are different points within a continuum. We can then focus on 
>>>>> the discussion of retention of data in its de-identified state 
>>>>> prior to moving to the ultimate unlinkable state.
>>>>> - Shane
>>>>> -----Original Message-----
>>>>> From: Rob van Eijk [mailto:rob@blaeu.com]
>>>>> Sent: Wednesday, March 13, 2013 8:28 AM
>>>>> To: Shane Wiley
>>>>> Cc: public-tracking@w3.org
>>>>> Subject: RE: ACTION-371: text defining de-identified data
>>>>> Hi Shane,
>>>>> I hear you and understand your position. But unlinkable and 
>>>>> de-identified are not mutual exclusive. Unlinkable data is a subset 
>>>>> of de-identified data, they just go through another step of 
>>>>> scrubbing).
>>>>> Adding it to the list is not hurting your position.
>>>>> The key towards the middle ground remains data retention, which 
>>>>> has to be proportionate to the purpose.
>>>>> Rob
>>>>> Shane Wiley schreef op 2013-03-13 16:13:
>>>>> Rob,
>>>>> I thought we had agreed to not mix the "unlinkable" term with
>>>>> "de-identified" here. In our discussions in Boston it appeared 
>>>>> there
>>>>> was general agreement that unlinkability in a step beyond
>>>>> de-identified. Once a record has been rendered de-identified, it 
>>>>> can
>>>>> later further be made unlinkable (using your definition of 
>>>>> unlinkable
>>>>> vs. the one I proposed). This is a significant sticking point for
>>>>> those of use attempting to find middle-ground here so hopefully we 
>>>>> can
>>>>> document the details in non-normative text but I'd ask that we 
>>>>> remove
>>>>> mention of unlinkable in the definition of de-identified at this 
>>>>> time
>>>>> (or else we've not really moved forward in this discussion in my
>>>>> opinion).
>>>>> - Shane
>>>>> -----Original Message-----
>>>>> From: Rob van Eijk [mailto:rob@blaeu.com]
>>>>> Sent: Wednesday, March 13, 2013 5:57 AM
>>>>> To: public-tracking@w3.org
>>>>> Subject: RE: ACTION-371: text defining de-identified data
>>>>> Dan, Kevin,
>>>>> I would really want the unlinkability in there as well. I propose 
>>>>> to
>>>>> add the text: made unlinkable
>>>>> Normative text: Data can be considered sufficiently de-identified 
>>>>> to
>>>>> the extent that it has been deleted, made unlinkable, modified,
>>>>> aggregated, anonymized or otherwise manipulated in order to 
>>>>> achieve a
>>>>> reasonable level of justified confidence that the data cannot
>>>>> reasonably be used to infer information about, or otherwise be 
>>>>> linked
>>>>> to, a particular user, user agent, computer or device.
>>>>> In terms of privacy by design, de-identification through 
>>>>> unlinkability
>>>>> is the strongest form of de-identtification IMHO.
>>>>> Rob
>>>>> Kevin Kiley schreef op 2013-03-12 19:03:
>>>>> Dan,
>>>>> In case I wasn't being clear in my last post, I (personally) 
>>>>> believe
>>>>> that
>>>>> User-agent should *NOT* be removed from the proposed text.
>>>>> I actually don't think it would do any harm to *ADD* the word
>>>>> 'Computer'
>>>>> as well ( which is present in the current FTC definition ) so it
>>>>> reads like this…
>>>>> Normative text:
>>>>> Data can be considered sufficiently de-identified to the extent 
>>>>> that
>>>>> it
>>>>> has been deleted, modified, aggregated, anonymized or otherwise
>>>>> manipulated in order to achieve a reasonable level of justified
>>>>> confidence that the data cannot reasonably be used to infer
>>>>> information
>>>>> about, or otherwise be linked to, a particular user, user agent,
>>>>> computer or device.
>>>>> I think that covers it pretty well, and *NO* 'clarifying text' is
>>>>> necessary.
>>>>> Just my 2 cents.
>>>>> Kevin Kiley
>>>>> Previous message(s)…
>>>>> Dan,
>>>>> Perhaps you can add text clarifying this perspective or, much like
>>>>> the FTC, suffice with "device" which I believe more than covers 
>>>>> what
>>>>> you're looking for here.
>>>>> - Shane
>>>>> From: Dan Auerbach [mailto:dan@eff.org]
>>>>> Sent: Tuesday, March 12, 2013 8:57 AM
>>>>> To: public-tracking@w3.org
>>>>> Subject: Re: ACTION-371: text defining de-identified data
>>>>> Shane and Kevin -- The phrase "user agent" in the text is intended 
>>>>> to
>>>>> refer to a particular user agent (not "Chrome 26" but rather "the
>>>>> browser running on Dan's laptop". I hoped that would be clear from
>>>>> context, but if it's not we can clarify. I may not be able to
>>>>> identify your device per se, but can identify that this is the 
>>>>> same
>>>>> browser as I saw before. I think this is the case with using 
>>>>> cookies,
>>>>> for example. It seems more accurate to me than lumping it all 
>>>>> under
>>>>> "device", and appropriate since the text of our document is 
>>>>> elsewhere
>>>>> focused on user agents, unlike the FTC text.
>>>>> Best,
>>>>> Dan
>>>>> On 03/12/2013 12:19 AM, Kevin Kiley wrote:
>>>>> 
>>>>>> Shane Wiley wrote...
>>>>>> I had removed "user agent" in the suggested edit as this could be
>>>>>> something as generic as "Chrome 26".
>>>>> It can also be something VERY specific... and tell you a LOT about
>>>>> the Computer/OS/Device being used.
>>>>> In the case of Mobile... it will pretty much tell you EXACTLY what
>>>>> 'Device' is being used.
>>>>> 
>>>>>> The FTC likewise does not use "user agent" in their definition.
>>>>> That's true... but BOTH definitions (W3C and FTC) currently 
>>>>> mention
>>>>> 'Device'... and the FTC
>>>>> reports go to great lengths about how important it is to exclude 
>>>>> any
>>>>> knowledge of 'the Device'
>>>>> from the de-identified data ( especially in the case of 'Mobile
>>>>> Devices' ).
>>>>> Kevin Kiley
>>>>> -- Edward W. Felten
>>>>> Professor of Computer Science and Public Affairs
>>>>> Director, Center for Information Technology Policy
>>>>> Princeton University
>>>>> 609-258-5906 http://www.cs.princeton.edu/~felten [3]
>>>> -- Dan Auerbach
>>>> Staff Technologist
>>>> Electronic Frontier Foundation
>>>> dan@eff.org
>>>> 415 436 9333 x134
>>>> 
>>>> Links:
>>>> ------
>>>> [1] tel:202.407.8812
>>>> [2] http://www.cdt.org
>>>> [3] http://www.cs.princeton.edu/%7Efelten
>>> 
>>> 
>>>
Received on Thursday, 14 March 2013 12:09:12 UTC