Re: definition of "unlinkable data" in the Compliance spec from Ed Felten on 2012-09-22 (public-tracking@w3.org from September 2012)

From: Ed Felten <ed@felten.com>
Date: Sat, 22 Sep 2012 15:53:19 -0400
To: Shane Wiley <wileys@yahoo-inc.com>
Cc: "Grimmelmann, James" <James.Grimmelmann@nyls.edu>, "<public-tracking@w3.org>" <public-tracking@w3.org>
Message-ID: <CANZBoGhgW_3h0r=Ub-e+D0Gh=20RHfkF6t4ogDktWK4ouJw1rQ@mail.gmail.com>
Reversing salted IP hashes requires 9 lines of code.

def reverseIpHash(salt, target):
   trialString = copy.copy(salt)
   trialString.extend(bytearray(4))
   for ip in range(256*256*256*256):
       trialIp =
bytearray([(ip>>24)&0xff,(ip>>16)&0xff,(ip>>8)&0xff,ip&0xff])
       trialString = copy.copy(salt)
       trialString.extend(trialIp)
       if hashlib.sha1(trialString).digest()==target:
           return ip


On Sat, Sep 22, 2012 at 12:42 PM, Shane Wiley <wileys@yahoo-inc.com> wrote:

> Ed,****
>
> ** **
>
> Not “easy” if the salt/key is strongly protected and/or rotated/destroyed
> on a regular basis.  A dictionary attack requires either the raw data or
> access to the salt key – neither of which should be made easy/possible.  I
> tend to see the IP Address issue through the lens of IPv6 these days which
> further creates barriers to what you position as “easy to recover”.  ****
>
> ** **
>
> The advocacy side of the group tends to lean towards absolutist terms and
> solutions – the real world isn’t that easy even if it feels that way in a
> classroom or a small lab.****
>
> ** **
>
> - Shane****
>
> ** **
>
> *From:* Ed Felten [mailto:ed@felten.com]
> *Sent:* Saturday, September 22, 2012 5:30 AM
>
> *To:* Shane Wiley
> *Cc:* Grimmelmann, James; <public-tracking@w3.org>
> *Subject:* Re: definition of "unlinkable data" in the Compliance spec****
>
> ** **
>
> It's easy to recover hashed IP addresses if they're hashed as a whole (and
> not per-octet).   An straightforward dictionary attack will work against
> all IPv4 addresses.  Even a dumb brute-force search over the entire 32-bit
> space is feasible.  IPv6 is a bit more complicated--some will be
> recoverable and some won't, depending on details of address allocation.***
> *
>
> ** **
>
> On Fri, Sep 21, 2012 at 1:16 PM, Shane Wiley <wileys@yahoo-inc.com> wrote:
> ****
>
> Ed,****
>
>  ****
>
> I disagree with the concept of “easy to recover” as I’m not suggesting
> hashing the individual octets but rather the entire IP Address (not a
> single octet or individualized octet hashing) – especially as you apply
> this to IPv6.  With the appropriate level of access to raw and hashed
> datasets, the necessary tools, and the intent, some anonymization schemes
> can be hacked (dictionary attacks being the most straight forward).  I
> don’t believe the goal here is an absolutist one (aka “completed
> destruction of identifiers”) and that is why “commercially reasonable” is
> the appropriate outcome.****
>
>  ****
>
> - Shane****
>
>  ****
>
> *From:* Ed Felten [mailto:ed@felten.com]
> *Sent:* Friday, September 21, 2012 10:01 AM
> *To:* Shane Wiley
> *Cc:* Grimmelmann, James; <public-tracking@w3.org>****
>
>
> *Subject:* Re: definition of "unlinkable data" in the Compliance spec****
>
>  ****
>
> By the way, hashing IP addresses (with or without salting) does not render
> them unlinkable.   After hashing, it's easy to recovery the original IP
> address.  The story is similar for other types of unique identifiers--there
> are ways to get to unlinkability, but hashing by itself won't be enough.**
> **
>
>  ****
>
> On Fri, Sep 21, 2012 at 12:01 PM, Shane Wiley <wileys@yahoo-inc.com>
> wrote:****
>
> <Ed - apologies for not getting back to you sooner - I was on vacation for
> the past week.>
>
> James,
>
> I like your approach the best and it was this perspective I was intending
> when writing the text that Ed is questioning.
>
> The goal is to find the middle-ground between complete destruction of data
> and an unlinkable state that still allows for longitudinal consistency for
> analytical purposes BUT CANNOT be linked back to a production system such
> that the data could be used to modify a single user's experience.
>
> For example, performing a one-way secret hash (salted hash) on identifiers
> (Cookie IDs, IP Addresses) and storing the resulting dataset in a
> logically/physically separate location from production data with strict
> access controls, policies, and employee education would meet the definition
> of "unlinkable" I'm aiming for.
>
> - Shane****
>
>
> -----Original Message-----
> From: Grimmelmann, James [mailto:James.Grimmelmann@nyls.edu]
> Sent: Friday, September 21, 2012 8:14 AM
> To: Lauren Gelman
> Cc: Ed Felten; <public-tracking@w3.org>
> Subject: Re: definition of "unlinkable data" in the Compliance spec
>
> I really like Lauren's suggestion.  My only concern is that "reasonably"
> and "reasonable" have so many different meanings in legal settings that it
> could be ambiguous.  Sometimes an action is "reasonable" if a person who is
> ethical and cautious would do it: it's not reasonable to leave sharp tools
> lying around in a children's play area, or to invest a trust fund in
> marshmallows.  Sometimes it refers to what a rational non-expert would
> believe about the subject, so a court will uphold a jury verdict unless "no
> reasonable jury" could have reached the conclusion it did.  Sometimes it's
> about the norms and expectations of an industry.  An auction might need to
> be conducted in a "commercially reasonable" way, which means for example
> giving enough notice that there will be real competitive bidding, but not
> spending more than the property is worth.
>
> I think this last sense is the most appropriate one in context.  So
> perhaps something like "data that cannot be associated with an identifiable
> person or user agent through commercially reasonable means."  That is, the
> question would be whether a normal business with normal resources and
> motivations would consider reidentifying the data to be feasible.
>
> James
>
> --------------------------------------------------
> James Grimmelmann              Professor of Law
> New York Law School                 (212) 431-2864
> 185 West Broadway       james.grimmelmann@nyls.edu<mailto:
> james.grimmelmann@nyls.edu>
> New York, NY 10013    http://james.grimmelmann.net
>
> On Sep 20, 2012, at 7:22 PM, Lauren Gelman <gelman@blurryedge.com<mailto:
> gelman@blurryedge.com>> wrote:
>
>
> Unlinkable data is data that cannot reasonably be associated with an
> identifiable person or user agent.
>
> Lauren Gelman
> BlurryEdge Strategies
> 415-627-8512
>
> On Sep 18, 2012, at 8:05 AM, Ed Felten wrote:
>
> Sorry to repost this, but nobody has answered any of my questions about
> Option 1 for the unlinkability definition.
>
> Note to proponents of Option 1 (if any): If nobody can explain or clarify
> Option 1, that will presumably be used as an argument against Option 1 when
> decision time comes.
>
> ---------- Forwarded message ----------
> From: Ed Felten <ed@felten.com<mailto:ed@felten.com>>
> Date: Thu, Sep 13, 2012 at 5:03 PM
> Subject: definition of "unlinkable data" in the Compliance spec
> To: "<public-tracking@w3.org<mailto:public-tracking@w3.org>>" <
> public-tracking@w3.org<mailto:public-tracking@w3.org>>
>
>
> I have some questions about the Option 1 definition of "Unlinkable Data",
> section 3.6.1 in the Compliance spec editor's draft.   The definition is as
> follows [fixing typos]:
>
> A party renders a dataset unlinkable when it:
> 1. takes commercially reasonable steps to de-identify data such that there
> is confidence that it contains information which could not be linked to a
> specific user, user agent, or device in a production environment [2. and 3.
> aren't relevant to my questions]
>
> I have several questions about what this means.
> (A) Why does the definition talk about a process of making data
> unlinkable, instead of directly defining what it means for data to be
> unlinkable?  Some data needs to be processed to make it unlinkable, but
> some data is unlinkable from the start.  The definition should speak to
> both, even though unlinkable-from-the-start data hasn't gone through any
> kind of process.  Suppose FirstCorp collects data X; SecondCorp collects
> X+Y but then runs a process that discards Y to leave it with only X; and
> ThirdCorp collects X+Y+Z but then minimizes away Y+Z to end up with X.
>  Shouldn't these three datasets be treated the same--because they are the
> same X--despite having been through different processes, or no process at
> all?
> (B) Why "commercially reasonable" rather than just "reasonable"?  The term
> "reasonable" already takes into account all relevant factors.  Can somebody
> give an example of something that would qualify as "commercially
> reasonable" but not "reasonable", or vice versa?  If not, "commercially"
> only makes the definition harder to understand.
> (C) "there is confidence" seems to raise two questions.  First, who is it
> that needs to be confident?  Second, can the confidence be just an
> unsupported gut feeling of optimism, or does there need to be some valid
> reason for confidence?  Presumably the intent is that the party holding the
> data has justified confidence that the data cannot be linked, but if so it
> might be better to spell that out.
> (D) Why "it contains information which could not be linked" rather than
> the simpler "it could not be linked"?  Do the extra words add any meaning?
> (E) What does "in a production environment" add?  If the goal is to rule
> out results demonstrated in a research environment, I doubt this language
> would accomplish that goal, because all of the re-identification research I
> know of required less than a production environment.  If the goal is to
> rule out linking approaches that aren't at all practical, some other
> language would probably be better.
>
> (I don't have questions about the meaning of Option 2; which shouldn't be
> interpreted as a preference for or against Option 2.)
>
>
>
> ****
>
>  ****
>
> ** **
>
Received on Saturday, 22 September 2012 19:54:05 UTC