- From: Rob van Eijk <rob@blaeu.com>
- Date: Fri, 15 Mar 2013 18:47:15 +0100
- To: Dan Auerbach <dan@eff.org>, public-tracking@w3.org, Shane Wiley <wileys@yahoo-inc.com>
- Message-ID: <1a7e924e-cc41-4111-953c-49213dd964f8@email.android.com>
Dan, Thanks for the thoughtfull reply. I understand now that we are on the same page. But I doubt that Shane is on that same page as well. If I understand Shane's position correctly, his view on de-identified does not come close to the green as I would like it to be. I just want to be absolutely sure that there is no wiggle-room in what it means to reach de-identified. @Shane: what is your view, taking into account the rely from Dan? Rob Dan Auerbach <dan@eff.org> wrote: >My view is that we do NOT need to define a third state of data. We have >green and red now. If a compelling argument is made that an orange >state >is needed, we can revisit, but I think that existing permitted uses >plus >having a small time frame for processing raw event data are strong >enough protections to not warrant this third state. Second, regarding >nomenclature, the FTC definition actually defines unlinkability in >terms >of de-identification, so I think it would be very confusing to stray >too >far from that definitional framework. > >A couple further replies inline: > >On 03/14/2013 04:09 AM, Justin Brookman wrote: >> OK, but as I said before, the standard does not currently envision >> three states of data. As written, all data pertaining to a network >> communication is in scope, unless it is deidentified,* in which case >> it is out of scope. You need to propose a third consequence for a >new >> class of data for this to have effect. >> >> * Noting that there is still ongoing discussion about what >> "deidentified" actually means, as evidenced by the recent emails from >> Ed, Shane, and Dan. >> >> Justin Brookman >> Director, Consumer Privacy >> Center for Democracy & Technology >> tel 202.407.8812 >> justin@cdt.org >> http://www.cdt.org >> @JustinBrookman >> @CenDemTech >> >> On 3/14/2013 5:39 AM, Rob van Eijk wrote: >>> >>> >>> In Boston Shane and I discussed the process of de-identification by >>> applying it to my mental model (red, orange and green data). Red >data >>> is raw event level data (eg log files with unique identifiers), >>> orange is still linkable but de-identified data, green is unlinkable >>> and therefore anonymous data. >>> >>> We agreed that in order to move from red to orange, or from orange >to >>> green, one needs to pass the barriers by processing. As seen in the >>> de-identrification workshop there are multiple ways to do that. I >>> illustrated 2 alternative practices: >>> >>> 1. One example is based on concatenating a random number to the >>> unique ID. This results in a lookup table of unique ID <-> random >>> number. >>> Getting from orange to red is braking the link (un-linkiability) by >>> throwing away the unique ID. No new red data can be linked to the >>> un-linkable data in the green. >I think the trouble with this model is the assumption that the unique >ID >will be the only means of identifying someone. If you'll allow me to >stick with the conceptual framework of a table for simplicity (think >mysql table or bigtable), I think we should get away from the mentality >that there are "identifiers" -- fields like udids, cookies, IPs, phone >numbers etc. Instead, it is more accurate to say that *every* field of >a >data set provides some bits of identifying information. > >An "orange" data set as you describe might still be super identifying, >if, for example, it is a wide table with lots of fields. As a concrete >example, URLs can be very identifying in some cases, as can timestamps. >Even data that you describe as "green" could still be identifying, if I >understand you correctly. In many instances, having events linked by a >random irreversible identifier (e.g. discarded salt) is simply not >enough to ensure that information can't be reasonably obtained about >users. In some cases it might be, but it depends a lot on that nature >of >the rest of the data in the table. > >>> >>> >>> 2. The other example is based on rotating hashes. Getting from red >to >>> orange is applying the hash. Getting from orange to green is braking >>> the link (un-linkability) by throwing away the salt. No new red data >>> can be linked to the un-linkable data in the green. >>> >>> >>> >>> So I am willing to give up the word unlinkable in the normative >>> de-identification text, but in exchange non-normative examples >should >>> be added. >I think it's a good suggestion to say that the non-normative examples >should be fleshed out. But I agree that they should suggest a stronger >version of "green" than I understand from your mental model above >(which >I hope I'm getting right). > > >>> >>> >>> >>> >>> <non-normative text) >>> De-identification can be accomplished by applying a mental model >>> (red, orange and green data). Red data is raw event level data (eg >>> log files with unique identifiers), orange is still linkable but >>> de-identified data, green is unlinkable and therefore anonymous >data. >>> >>> In order to move from red to orange, or from orange to green, one >>> needs to pass the barriers by processing. There are multiple ways to >>> do that: >>> >>> 1. One example is based on concatenating a random number to the >>> unique ID. This results in a lookup table of unique ID <-> random >>> number. >>> Getting from orange to red is braking the link (un-linkiability) by >>> throwing away the unique ID. No new red data can be linked to the >>> un-linkable data in the green. >>> >>> 2. Another example is based on rotating hashes. Getting from red to >>> orange is applying the hash. Getting from orange to green is braking >>> the link (un-linkability) by throwing away the salt. No new red data >>> can be linked to the un-linkable data in the green. >>> </non-normative text) >>> >>> >>> Rob >>> >>> >>> Dan Auerbach schreef op 2013-03-13 19:01: >>>> I also agree that we should just stick with de-identified, just as >a >>>> point of nomenclature. For one, unlike what you propose below, Rob, >>>> the FTC text actually defines unlinkability in terms of >>>> de-identification, so I think it would be very confusing if we did >the >>>> opposite here. >>>> >>>> That said, we did NOT agree at the face-to-face that unlinkability >>>> was a "step beyond de-identified"; we are not at all weakening the >>>> standard with our word choice. For unlinkability and >de-identification >>>> both, we do NOT propose a holy grail of provably perfect >anonymization >>>> that can't be achieved in practice (or even in theory, really!). >>>> However, for both we require a significantly higher standard than, >for >>>> example, keeping a pseudonymous data set of browsing history. The >>>> first non-normative example is intended to make this clear, but I >can >>>> flesh it out if it's not. >>>> >>>> Dan >>>> >>>> On 03/13/2013 10:28 AM, Shane Wiley wrote: >>>> >>>>> Ed, >>>>> >>>>> Agreed - reasonably attempting to clear unique identifiers or >>>>> information that could lead to unique identification in URLs >should >>>>> also be included. >>>>> >>>>> - Shane >>>>> >>>>> FROM: Edward W. Felten [mailto:felten@CS.Princeton.EDU] >>>>> SENT: Wednesday, March 13, 2013 10:22 AM >>>>> TO: Justin Brookman >>>>> CC: <public-tracking@w3.org> >>>>> SUBJECT: Re: ACTION-371: text defining de-identified data >>>>> >>>>> But we should be equally clear that "de-identify" means more than >>>>> just removing the most obvious identifiers from the data. >>>>> >>>>> On Wed, Mar 13, 2013 at 1:07 PM, Justin Brookman <justin@cdt.org> >>>>> wrote: >>>>> >>>>> Shane is right that we did choose to use "deidentified" instead of >>>>> "unlinkable" at the Cambridge meeting. So I agree we probably >>>>> should not use "unlinkable" to define "deidentified" in the >>>>> standard. However, I don't see why we need to define "unlinkable" >>>>> at all, as it has no operational meaning, and was rejected because >>>>> it implied a technological impossibility of relinking, which is >not >>>>> a standard that can be reasonably achieved. >>>>> >>>>> Justin Brookman >>>>> Director, Consumer Privacy >>>>> Center for Democracy & Technology >>>>> tel 202.407.8812 [1] >>>>> justin@cdt.org >>>>> http://www.cdt.org [2] >>>>> @JustinBrookman >>>>> @CenDemTech >>>>> >>>>> On 3/13/2013 11:35 AM, Shane Wiley wrote: >>>>> >>>>> Rob, >>>>> >>>>> So we're agreed unlinkability requires more processing than >>>>> de-identified - good. I would recommend we define de-identified >>>>> (nearly done) and unlinkability separately to clearly demonstrate >>>>> they are different points within a continuum. We can then focus on >>>>> the discussion of retention of data in its de-identified state >>>>> prior to moving to the ultimate unlinkable state. >>>>> >>>>> - Shane >>>>> >>>>> -----Original Message----- >>>>> From: Rob van Eijk [mailto:rob@blaeu.com] >>>>> Sent: Wednesday, March 13, 2013 8:28 AM >>>>> To: Shane Wiley >>>>> Cc: public-tracking@w3.org >>>>> Subject: RE: ACTION-371: text defining de-identified data >>>>> >>>>> Hi Shane, >>>>> >>>>> I hear you and understand your position. But unlinkable and >>>>> de-identified are not mutual exclusive. Unlinkable data is a >subset >>>>> of de-identified data, they just go through another step of >>>>> scrubbing). >>>>> Adding it to the list is not hurting your position. >>>>> >>>>> The key towards the middle ground remains data retention, which >has >>>>> to be proportionate to the purpose. >>>>> >>>>> Rob >>>>> >>>>> Shane Wiley schreef op 2013-03-13 16:13: >>>>> >>>>> Rob, >>>>> >>>>> I thought we had agreed to not mix the "unlinkable" term with >>>>> "de-identified" here. In our discussions in Boston it appeared >there >>>>> was general agreement that unlinkability in a step beyond >>>>> de-identified. Once a record has been rendered de-identified, it >can >>>>> later further be made unlinkable (using your definition of >unlinkable >>>>> vs. the one I proposed). This is a significant sticking point for >>>>> those of use attempting to find middle-ground here so hopefully we >can >>>>> document the details in non-normative text but I'd ask that we >remove >>>>> mention of unlinkable in the definition of de-identified at this >time >>>>> (or else we've not really moved forward in this discussion in my >>>>> opinion). >>>>> >>>>> - Shane >>>>> >>>>> -----Original Message----- >>>>> From: Rob van Eijk [mailto:rob@blaeu.com] >>>>> Sent: Wednesday, March 13, 2013 5:57 AM >>>>> To: public-tracking@w3.org >>>>> Subject: RE: ACTION-371: text defining de-identified data >>>>> >>>>> Dan, Kevin, >>>>> >>>>> I would really want the unlinkability in there as well. I propose >to >>>>> add the text: made unlinkable >>>>> >>>>> Normative text: Data can be considered sufficiently de-identified >to >>>>> the extent that it has been deleted, made unlinkable, modified, >>>>> aggregated, anonymized or otherwise manipulated in order to >achieve a >>>>> reasonable level of justified confidence that the data cannot >>>>> reasonably be used to infer information about, or otherwise be >linked >>>>> to, a particular user, user agent, computer or device. >>>>> >>>>> In terms of privacy by design, de-identification through >unlinkability >>>>> is the strongest form of de-identtification IMHO. >>>>> >>>>> Rob >>>>> >>>>> Kevin Kiley schreef op 2013-03-12 19:03: >>>>> >>>>> Dan, >>>>> >>>>> In case I wasn't being clear in my last post, I (personally) >believe >>>>> that >>>>> >>>>> User-agent should *NOT* be removed from the proposed text. >>>>> >>>>> I actually don't think it would do any harm to *ADD* the word >>>>> 'Computer' >>>>> >>>>> as well ( which is present in the current FTC definition ) so it >>>>> reads like this… >>>>> >>>>> Normative text: >>>>> >>>>> Data can be considered sufficiently de-identified to the extent >that >>>>> it >>>>> >>>>> has been deleted, modified, aggregated, anonymized or otherwise >>>>> >>>>> manipulated in order to achieve a reasonable level of justified >>>>> >>>>> confidence that the data cannot reasonably be used to infer >>>>> information >>>>> >>>>> about, or otherwise be linked to, a particular user, user agent, >>>>> computer or device. >>>>> >>>>> I think that covers it pretty well, and *NO* 'clarifying text' is >>>>> necessary. >>>>> >>>>> Just my 2 cents. >>>>> >>>>> Kevin Kiley >>>>> >>>>> Previous message(s)… >>>>> >>>>> Dan, >>>>> >>>>> Perhaps you can add text clarifying this perspective or, much like >>>>> the FTC, suffice with "device" which I believe more than covers >what >>>>> you're looking for here. >>>>> >>>>> - Shane >>>>> >>>>> From: Dan Auerbach [mailto:dan@eff.org] >>>>> >>>>> Sent: Tuesday, March 12, 2013 8:57 AM >>>>> >>>>> To: public-tracking@w3.org >>>>> >>>>> Subject: Re: ACTION-371: text defining de-identified data >>>>> >>>>> Shane and Kevin -- The phrase "user agent" in the text is intended >to >>>>> refer to a particular user agent (not "Chrome 26" but rather "the >>>>> browser running on Dan's laptop". I hoped that would be clear from >>>>> context, but if it's not we can clarify. I may not be able to >>>>> identify your device per se, but can identify that this is the >same >>>>> browser as I saw before. I think this is the case with using >cookies, >>>>> for example. It seems more accurate to me than lumping it all >under >>>>> "device", and appropriate since the text of our document is >elsewhere >>>>> focused on user agents, unlike the FTC text. >>>>> >>>>> Best, >>>>> >>>>> Dan >>>>> >>>>> On 03/12/2013 12:19 AM, Kevin Kiley wrote: >>>>> >>>>>> Shane Wiley wrote... >>>>>> I had removed "user agent" in the suggested edit as this could be >>>>>> something as generic as "Chrome 26". >>>>> >>>>> It can also be something VERY specific... and tell you a LOT about >>>>> the Computer/OS/Device being used. >>>>> >>>>> In the case of Mobile... it will pretty much tell you EXACTLY what >>>>> 'Device' is being used. >>>>> >>>>>> The FTC likewise does not use "user agent" in their definition. >>>>> >>>>> That's true... but BOTH definitions (W3C and FTC) currently >mention >>>>> 'Device'... and the FTC >>>>> >>>>> reports go to great lengths about how important it is to exclude >any >>>>> knowledge of 'the Device' >>>>> >>>>> from the de-identified data ( especially in the case of 'Mobile >>>>> Devices' ). >>>>> >>>>> Kevin Kiley >>>>> >>>>> -- >>>>> Edward W. Felten >>>>> Professor of Computer Science and Public Affairs >>>>> Director, Center for Information Technology Policy >>>>> Princeton University >>>>> 609-258-5906 http://www.cs.princeton.edu/~felten [3] >>>> >>>> -- >>>> Dan Auerbach >>>> Staff Technologist >>>> Electronic Frontier Foundation >>>> dan@eff.org >>>> 415 436 9333 x134 >>>> >>>> >>>> Links: >>>> ------ >>>> [1] tel:202.407.8812 >>>> [2] http://www.cdt.org >>>> [3] http://www.cs.princeton.edu/%7Efelten >>> >>> >>> >> >> >> > > >-- >Dan Auerbach >Staff Technologist >Electronic Frontier Foundation >dan@eff.org >415 436 9333 x134
Received on Friday, 15 March 2013 17:47:54 UTC