Re: definition of "unlinkable data" in the Compliance spec from David Wainberg on 2012-09-22 (public-tracking@w3.org from September 2012)

From: David Wainberg <david@networkadvertising.org>
Date: Sat, 22 Sep 2012 15:24:00 -0400
To: Joseph Lorenzo Hall <joe@cdt.org>
CC: Shane Wiley <wileys@yahoo-inc.com>, Ed Felten <ed@felten.com>, "Grimmelmann, James" <James.Grimmelmann@nyls.edu>, "<public-tracking@w3.org>" <public-tracking@w3.org>
Message-ID: <505E1050.10205@networkadvertising.org>
One thing that strikes me as odd about this discussion is the 
possibility we might apply higher standards for information security on 
third party advertising businesses who hold anonymous/pseudonymous 
interest data than we do on other parties who have true PII or other 
more sensitive data. How much effort is it reasonable to require to 
protect against data breach for anonymous/pseudonymous data used for 
online advertising? Or, to put it another way, given the limited value 
of the data, how likely is it to be breached and misused? I agree that 
some effort at protecting it is needed, we as always we need to tune the 
level of security based on the cost and the risk. Absolute security 
seems unwarranted, unreasonable, and infeasible in this case. What Shane 
has been proposing seems very reasonable and feasible for real-world 
application.

On 9/22/12 1:31 PM, Joseph Lorenzo Hall wrote:
> We've seen numerous examples lately of data breaches where salted, 
> hashed data with a small input space have been "unhashed", revealing 
> sensitive information. While you tend to think in terms of IPv6, we 
> need to have some notion of what is good enough (technically or via 
> policy) and when we need to reevaluate this with changes in things 
> that tend to increas linkability (computational power, bankruptcy that 
> might provide incentives for de-silo'ing).
>
> Ed is probably not well described as an advocate, but I get your gist.
>
> best, Joe
>
> -- 
> Joseph Lorenzo Hall
> Senior Staff Technologist
> Center for Democracy & Technology
> https://www.cdt.org/
>
> On Sep 22, 2012, at 12:42, Shane Wiley <wileys@yahoo-inc.com 
> <mailto:wileys@yahoo-inc.com>> wrote:
>
>> Ed,
>>
>> Not “easy” if the salt/key is strongly protected and/or 
>> rotated/destroyed on a regular basis.  A dictionary attack requires 
>> either the raw data or access to the salt key – neither of which 
>> should be made easy/possible.  I tend to see the IP Address issue 
>> through the lens of IPv6 these days which further creates barriers to 
>> what you position as “easy to recover”.
>>
>> The advocacy side of the group tends to lean towards absolutist terms 
>> and solutions – the real world isn’t that easy even if it feels that 
>> way in a classroom or a small lab.
>>
>> - Shane
>>
>> *From:*Ed Felten [mailto:ed@felten.com]
>> *Sent:* Saturday, September 22, 2012 5:30 AM
>> *To:* Shane Wiley
>> *Cc:* Grimmelmann, James; <public-tracking@w3.org 
>> <mailto:public-tracking@w3.org>>
>> *Subject:* Re: definition of "unlinkable data" in the Compliance spec
>>
>> It's easy to recover hashed IP addresses if they're hashed as a whole 
>> (and not per-octet).   An straightforward dictionary attack will work 
>> against all IPv4 addresses.  Even a dumb brute-force search over the 
>> entire 32-bit space is feasible.  IPv6 is a bit more 
>> complicated--some will be recoverable and some won't, depending on 
>> details of address allocation.
>>
>> On Fri, Sep 21, 2012 at 1:16 PM, Shane Wiley <wileys@yahoo-inc.com 
>> <mailto:wileys@yahoo-inc.com>> wrote:
>>
>> Ed,
>>
>> I disagree with the concept of “easy to recover” as I’m not 
>> suggesting hashing the individual octets but rather the entire IP 
>> Address (not a single octet or individualized octet hashing) – 
>> especially as you apply this to IPv6.  With the appropriate level of 
>> access to raw and hashed datasets, the necessary tools, and the 
>> intent, some anonymization schemes can be hacked (dictionary attacks 
>> being the most straight forward).  I don’t believe the goal here is 
>> an absolutist one (aka “completed destruction of identifiers”) and 
>> that is why “commercially reasonable” is the appropriate outcome.
>>
>> - Shane
>>
>> *From:*Ed Felten [mailto:ed@felten.com <mailto:ed@felten.com>]
>> *Sent:* Friday, September 21, 2012 10:01 AM
>> *To:* Shane Wiley
>> *Cc:* Grimmelmann, James; <public-tracking@w3.org 
>> <mailto:public-tracking@w3.org>>
>>
>>
>> *Subject:* Re: definition of "unlinkable data" in the Compliance spec
>>
>> By the way, hashing IP addresses (with or without salting) does not 
>> render them unlinkable.   After hashing, it's easy to recovery the 
>> original IP address.  The story is similar for other types of unique 
>> identifiers--there are ways to get to unlinkability, but hashing by 
>> itself won't be enough.
>>
>> On Fri, Sep 21, 2012 at 12:01 PM, Shane Wiley <wileys@yahoo-inc.com 
>> <mailto:wileys@yahoo-inc.com>> wrote:
>>
>> <Ed - apologies for not getting back to you sooner - I was on 
>> vacation for the past week.>
>>
>> James,
>>
>> I like your approach the best and it was this perspective I was 
>> intending when writing the text that Ed is questioning.
>>
>> The goal is to find the middle-ground between complete destruction of 
>> data and an unlinkable state that still allows for longitudinal 
>> consistency for analytical purposes BUT CANNOT be linked back to a 
>> production system such that the data could be used to modify a single 
>> user's experience.
>>
>> For example, performing a one-way secret hash (salted hash) on 
>> identifiers (Cookie IDs, IP Addresses) and storing the resulting 
>> dataset in a logically/physically separate location from production 
>> data with strict access controls, policies, and employee education 
>> would meet the definition of "unlinkable" I'm aiming for.
>>
>> - Shane
>>
>>
>> -----Original Message-----
>> From: Grimmelmann, James [mailto:James.Grimmelmann@nyls.edu 
>> <mailto:James.Grimmelmann@nyls.edu>]
>> Sent: Friday, September 21, 2012 8:14 AM
>> To: Lauren Gelman
>> Cc: Ed Felten; <public-tracking@w3.org <mailto:public-tracking@w3.org>>
>> Subject: Re: definition of "unlinkable data" in the Compliance spec
>>
>> I really like Lauren's suggestion.  My only concern is that 
>> "reasonably" and "reasonable" have so many different meanings in 
>> legal settings that it could be ambiguous.  Sometimes an action is 
>> "reasonable" if a person who is ethical and cautious would do it: 
>> it's not reasonable to leave sharp tools lying around in a children's 
>> play area, or to invest a trust fund in marshmallows.  Sometimes it 
>> refers to what a rational non-expert would believe about the subject, 
>> so a court will uphold a jury verdict unless "no reasonable jury" 
>> could have reached the conclusion it did.  Sometimes it's about the 
>> norms and expectations of an industry.  An auction might need to be 
>> conducted in a "commercially reasonable" way, which means for example 
>> giving enough notice that there will be real competitive bidding, but 
>> not spending more than the property is worth.
>>
>> I think this last sense is the most appropriate one in context.  So 
>> perhaps something like "data that cannot be associated with an 
>> identifiable person or user agent through commercially reasonable 
>> means."  That is, the question would be whether a normal business 
>> with normal resources and motivations would consider reidentifying 
>> the data to be feasible.
>>
>> James
>>
>> --------------------------------------------------
>> James Grimmelmann  Professor of Law
>> New York Law School (212) 431-2864 <tel:%28212%29%20431-2864>
>> 185 West Broadway james.grimmelmann@nyls.edu 
>> <mailto:james.grimmelmann@nyls.edu><mailto:james.grimmelmann@nyls.edu 
>> <mailto:james.grimmelmann@nyls.edu>>
>> New York, NY 10013 http://james.grimmelmann.net
>>
>> On Sep 20, 2012, at 7:22 PM, Lauren Gelman <gelman@blurryedge.com 
>> <mailto:gelman@blurryedge.com><mailto:gelman@blurryedge.com 
>> <mailto:gelman@blurryedge.com>>> wrote:
>>
>>
>> Unlinkable data is data that cannot reasonably be associated with an 
>> identifiable person or user agent.
>>
>> Lauren Gelman
>> BlurryEdge Strategies
>> 415-627-8512 <tel:415-627-8512>
>>
>> On Sep 18, 2012, at 8:05 AM, Ed Felten wrote:
>>
>> Sorry to repost this, but nobody has answered any of my questions 
>> about Option 1 for the unlinkability definition.
>>
>> Note to proponents of Option 1 (if any): If nobody can explain or 
>> clarify Option 1, that will presumably be used as an argument against 
>> Option 1 when decision time comes.
>>
>> ---------- Forwarded message ----------
>> From: Ed Felten <ed@felten.com 
>> <mailto:ed@felten.com><mailto:ed@felten.com <mailto:ed@felten.com>>>
>> Date: Thu, Sep 13, 2012 at 5:03 PM
>> Subject: definition of "unlinkable data" in the Compliance spec
>> To: "<public-tracking@w3.org 
>> <mailto:public-tracking@w3.org><mailto:public-tracking@w3.org 
>> <mailto:public-tracking@w3.org>>>" <public-tracking@w3.org 
>> <mailto:public-tracking@w3.org><mailto:public-tracking@w3.org 
>> <mailto:public-tracking@w3.org>>>
>>
>>
>> I have some questions about the Option 1 definition of "Unlinkable 
>> Data", section 3.6.1 in the Compliance spec editor's draft.   The 
>> definition is as follows [fixing typos]:
>>
>> A party renders a dataset unlinkable when it:
>> 1. takes commercially reasonable steps to de-identify data such that 
>> there is confidence that it contains information which could not be 
>> linked to a specific user, user agent, or device in a production 
>> environment [2. and 3. aren't relevant to my questions]
>>
>> I have several questions about what this means.
>> (A) Why does the definition talk about a process of making data 
>> unlinkable, instead of directly defining what it means for data to be 
>> unlinkable?  Some data needs to be processed to make it unlinkable, 
>> but some data is unlinkable from the start.  The definition should 
>> speak to both, even though unlinkable-from-the-start data hasn't gone 
>> through any kind of process.  Suppose FirstCorp collects data X; 
>> SecondCorp collects X+Y but then runs a process that discards Y to 
>> leave it with only X; and ThirdCorp collects X+Y+Z but then minimizes 
>> away Y+Z to end up with X.  Shouldn't these three datasets be treated 
>> the same--because they are the same X--despite having been through 
>> different processes, or no process at all?
>> (B) Why "commercially reasonable" rather than just "reasonable"?  The 
>> term "reasonable" already takes into account all relevant factors. 
>>  Can somebody give an example of something that would qualify as 
>> "commercially reasonable" but not "reasonable", or vice versa?  If 
>> not, "commercially" only makes the definition harder to understand.
>> (C) "there is confidence" seems to raise two questions.  First, who 
>> is it that needs to be confident?  Second, can the confidence be just 
>> an unsupported gut feeling of optimism, or does there need to be some 
>> valid reason for confidence?  Presumably the intent is that the party 
>> holding the data has justified confidence that the data cannot be 
>> linked, but if so it might be better to spell that out.
>> (D) Why "it contains information which could not be linked" rather 
>> than the simpler "it could not be linked"?  Do the extra words add 
>> any meaning?
>> (E) What does "in a production environment" add?  If the goal is to 
>> rule out results demonstrated in a research environment, I doubt this 
>> language would accomplish that goal, because all of the 
>> re-identification research I know of required less than a production 
>> environment.  If the goal is to rule out linking approaches that 
>> aren't at all practical, some other language would probably be better.
>>
>> (I don't have questions about the meaning of Option 2; which 
>> shouldn't be interpreted as a preference for or against Option 2.)
>>
>>
>>
Received on Saturday, 22 September 2012 19:24:29 UTC