- From: Ed Felten <ed@felten.com>
- Date: Sat, 22 Sep 2012 15:53:19 -0400
- To: Shane Wiley <wileys@yahoo-inc.com>
- Cc: "Grimmelmann, James" <James.Grimmelmann@nyls.edu>, "<public-tracking@w3.org>" <public-tracking@w3.org>
- Message-ID: <CANZBoGhgW_3h0r=Ub-e+D0Gh=20RHfkF6t4ogDktWK4ouJw1rQ@mail.gmail.com>
Reversing salted IP hashes requires 9 lines of code. def reverseIpHash(salt, target): trialString = copy.copy(salt) trialString.extend(bytearray(4)) for ip in range(256*256*256*256): trialIp = bytearray([(ip>>24)&0xff,(ip>>16)&0xff,(ip>>8)&0xff,ip&0xff]) trialString = copy.copy(salt) trialString.extend(trialIp) if hashlib.sha1(trialString).digest()==target: return ip On Sat, Sep 22, 2012 at 12:42 PM, Shane Wiley <wileys@yahoo-inc.com> wrote: > Ed,**** > > ** ** > > Not “easy” if the salt/key is strongly protected and/or rotated/destroyed > on a regular basis. A dictionary attack requires either the raw data or > access to the salt key – neither of which should be made easy/possible. I > tend to see the IP Address issue through the lens of IPv6 these days which > further creates barriers to what you position as “easy to recover”. **** > > ** ** > > The advocacy side of the group tends to lean towards absolutist terms and > solutions – the real world isn’t that easy even if it feels that way in a > classroom or a small lab.**** > > ** ** > > - Shane**** > > ** ** > > *From:* Ed Felten [mailto:ed@felten.com] > *Sent:* Saturday, September 22, 2012 5:30 AM > > *To:* Shane Wiley > *Cc:* Grimmelmann, James; <public-tracking@w3.org> > *Subject:* Re: definition of "unlinkable data" in the Compliance spec**** > > ** ** > > It's easy to recover hashed IP addresses if they're hashed as a whole (and > not per-octet). An straightforward dictionary attack will work against > all IPv4 addresses. Even a dumb brute-force search over the entire 32-bit > space is feasible. IPv6 is a bit more complicated--some will be > recoverable and some won't, depending on details of address allocation.*** > * > > ** ** > > On Fri, Sep 21, 2012 at 1:16 PM, Shane Wiley <wileys@yahoo-inc.com> wrote: > **** > > Ed,**** > > **** > > I disagree with the concept of “easy to recover” as I’m not suggesting > hashing the individual octets but rather the entire IP Address (not a > single octet or individualized octet hashing) – especially as you apply > this to IPv6. With the appropriate level of access to raw and hashed > datasets, the necessary tools, and the intent, some anonymization schemes > can be hacked (dictionary attacks being the most straight forward). I > don’t believe the goal here is an absolutist one (aka “completed > destruction of identifiers”) and that is why “commercially reasonable” is > the appropriate outcome.**** > > **** > > - Shane**** > > **** > > *From:* Ed Felten [mailto:ed@felten.com] > *Sent:* Friday, September 21, 2012 10:01 AM > *To:* Shane Wiley > *Cc:* Grimmelmann, James; <public-tracking@w3.org>**** > > > *Subject:* Re: definition of "unlinkable data" in the Compliance spec**** > > **** > > By the way, hashing IP addresses (with or without salting) does not render > them unlinkable. After hashing, it's easy to recovery the original IP > address. The story is similar for other types of unique identifiers--there > are ways to get to unlinkability, but hashing by itself won't be enough.** > ** > > **** > > On Fri, Sep 21, 2012 at 12:01 PM, Shane Wiley <wileys@yahoo-inc.com> > wrote:**** > > <Ed - apologies for not getting back to you sooner - I was on vacation for > the past week.> > > James, > > I like your approach the best and it was this perspective I was intending > when writing the text that Ed is questioning. > > The goal is to find the middle-ground between complete destruction of data > and an unlinkable state that still allows for longitudinal consistency for > analytical purposes BUT CANNOT be linked back to a production system such > that the data could be used to modify a single user's experience. > > For example, performing a one-way secret hash (salted hash) on identifiers > (Cookie IDs, IP Addresses) and storing the resulting dataset in a > logically/physically separate location from production data with strict > access controls, policies, and employee education would meet the definition > of "unlinkable" I'm aiming for. > > - Shane**** > > > -----Original Message----- > From: Grimmelmann, James [mailto:James.Grimmelmann@nyls.edu] > Sent: Friday, September 21, 2012 8:14 AM > To: Lauren Gelman > Cc: Ed Felten; <public-tracking@w3.org> > Subject: Re: definition of "unlinkable data" in the Compliance spec > > I really like Lauren's suggestion. My only concern is that "reasonably" > and "reasonable" have so many different meanings in legal settings that it > could be ambiguous. Sometimes an action is "reasonable" if a person who is > ethical and cautious would do it: it's not reasonable to leave sharp tools > lying around in a children's play area, or to invest a trust fund in > marshmallows. Sometimes it refers to what a rational non-expert would > believe about the subject, so a court will uphold a jury verdict unless "no > reasonable jury" could have reached the conclusion it did. Sometimes it's > about the norms and expectations of an industry. An auction might need to > be conducted in a "commercially reasonable" way, which means for example > giving enough notice that there will be real competitive bidding, but not > spending more than the property is worth. > > I think this last sense is the most appropriate one in context. So > perhaps something like "data that cannot be associated with an identifiable > person or user agent through commercially reasonable means." That is, the > question would be whether a normal business with normal resources and > motivations would consider reidentifying the data to be feasible. > > James > > -------------------------------------------------- > James Grimmelmann Professor of Law > New York Law School (212) 431-2864 > 185 West Broadway james.grimmelmann@nyls.edu<mailto: > james.grimmelmann@nyls.edu> > New York, NY 10013 http://james.grimmelmann.net > > On Sep 20, 2012, at 7:22 PM, Lauren Gelman <gelman@blurryedge.com<mailto: > gelman@blurryedge.com>> wrote: > > > Unlinkable data is data that cannot reasonably be associated with an > identifiable person or user agent. > > Lauren Gelman > BlurryEdge Strategies > 415-627-8512 > > On Sep 18, 2012, at 8:05 AM, Ed Felten wrote: > > Sorry to repost this, but nobody has answered any of my questions about > Option 1 for the unlinkability definition. > > Note to proponents of Option 1 (if any): If nobody can explain or clarify > Option 1, that will presumably be used as an argument against Option 1 when > decision time comes. > > ---------- Forwarded message ---------- > From: Ed Felten <ed@felten.com<mailto:ed@felten.com>> > Date: Thu, Sep 13, 2012 at 5:03 PM > Subject: definition of "unlinkable data" in the Compliance spec > To: "<public-tracking@w3.org<mailto:public-tracking@w3.org>>" < > public-tracking@w3.org<mailto:public-tracking@w3.org>> > > > I have some questions about the Option 1 definition of "Unlinkable Data", > section 3.6.1 in the Compliance spec editor's draft. The definition is as > follows [fixing typos]: > > A party renders a dataset unlinkable when it: > 1. takes commercially reasonable steps to de-identify data such that there > is confidence that it contains information which could not be linked to a > specific user, user agent, or device in a production environment [2. and 3. > aren't relevant to my questions] > > I have several questions about what this means. > (A) Why does the definition talk about a process of making data > unlinkable, instead of directly defining what it means for data to be > unlinkable? Some data needs to be processed to make it unlinkable, but > some data is unlinkable from the start. The definition should speak to > both, even though unlinkable-from-the-start data hasn't gone through any > kind of process. Suppose FirstCorp collects data X; SecondCorp collects > X+Y but then runs a process that discards Y to leave it with only X; and > ThirdCorp collects X+Y+Z but then minimizes away Y+Z to end up with X. > Shouldn't these three datasets be treated the same--because they are the > same X--despite having been through different processes, or no process at > all? > (B) Why "commercially reasonable" rather than just "reasonable"? The term > "reasonable" already takes into account all relevant factors. Can somebody > give an example of something that would qualify as "commercially > reasonable" but not "reasonable", or vice versa? If not, "commercially" > only makes the definition harder to understand. > (C) "there is confidence" seems to raise two questions. First, who is it > that needs to be confident? Second, can the confidence be just an > unsupported gut feeling of optimism, or does there need to be some valid > reason for confidence? Presumably the intent is that the party holding the > data has justified confidence that the data cannot be linked, but if so it > might be better to spell that out. > (D) Why "it contains information which could not be linked" rather than > the simpler "it could not be linked"? Do the extra words add any meaning? > (E) What does "in a production environment" add? If the goal is to rule > out results demonstrated in a research environment, I doubt this language > would accomplish that goal, because all of the re-identification research I > know of required less than a production environment. If the goal is to > rule out linking approaches that aren't at all practical, some other > language would probably be better. > > (I don't have questions about the meaning of Option 2; which shouldn't be > interpreted as a preference for or against Option 2.) > > > > **** > > **** > > ** ** >
Received on Saturday, 22 September 2012 19:54:05 UTC