W3C home > Mailing lists > Public > public-tracking@w3.org > September 2012

Re: definition of "unlinkable data" in the Compliance spec

From: Ed Felten <ed@felten.com>
Date: Fri, 21 Sep 2012 13:00:32 -0400
Message-ID: <CANZBoGgfmHuyJhzK4zGdegJrXBUQQ34=qdupKXk8SArPtNaUuA@mail.gmail.com>
To: Shane Wiley <wileys@yahoo-inc.com>
Cc: "Grimmelmann, James" <James.Grimmelmann@nyls.edu>, "<public-tracking@w3.org>" <public-tracking@w3.org>
By the way, hashing IP addresses (with or without salting) does not render
them unlinkable.   After hashing, it's easy to recovery the original IP
address.  The story is similar for other types of unique identifiers--there
are ways to get to unlinkability, but hashing by itself won't be enough.

On Fri, Sep 21, 2012 at 12:01 PM, Shane Wiley <wileys@yahoo-inc.com> wrote:

> <Ed - apologies for not getting back to you sooner - I was on vacation for
> the past week.>
> James,
> I like your approach the best and it was this perspective I was intending
> when writing the text that Ed is questioning.
> The goal is to find the middle-ground between complete destruction of data
> and an unlinkable state that still allows for longitudinal consistency for
> analytical purposes BUT CANNOT be linked back to a production system such
> that the data could be used to modify a single user's experience.
> For example, performing a one-way secret hash (salted hash) on identifiers
> (Cookie IDs, IP Addresses) and storing the resulting dataset in a
> logically/physically separate location from production data with strict
> access controls, policies, and employee education would meet the definition
> of "unlinkable" I'm aiming for.
> - Shane
> -----Original Message-----
> From: Grimmelmann, James [mailto:James.Grimmelmann@nyls.edu]
> Sent: Friday, September 21, 2012 8:14 AM
> To: Lauren Gelman
> Cc: Ed Felten; <public-tracking@w3.org>
> Subject: Re: definition of "unlinkable data" in the Compliance spec
> I really like Lauren's suggestion.  My only concern is that "reasonably"
> and "reasonable" have so many different meanings in legal settings that it
> could be ambiguous.  Sometimes an action is "reasonable" if a person who is
> ethical and cautious would do it: it's not reasonable to leave sharp tools
> lying around in a children's play area, or to invest a trust fund in
> marshmallows.  Sometimes it refers to what a rational non-expert would
> believe about the subject, so a court will uphold a jury verdict unless "no
> reasonable jury" could have reached the conclusion it did.  Sometimes it's
> about the norms and expectations of an industry.  An auction might need to
> be conducted in a "commercially reasonable" way, which means for example
> giving enough notice that there will be real competitive bidding, but not
> spending more than the property is worth.
> I think this last sense is the most appropriate one in context.  So
> perhaps something like "data that cannot be associated with an identifiable
> person or user agent through commercially reasonable means."  That is, the
> question would be whether a normal business with normal resources and
> motivations would consider reidentifying the data to be feasible.
> James
> --------------------------------------------------
> James Grimmelmann              Professor of Law
> New York Law School                 (212) 431-2864
> 185 West Broadway       james.grimmelmann@nyls.edu<mailto:
> james.grimmelmann@nyls.edu>
> New York, NY 10013    http://james.grimmelmann.net
> On Sep 20, 2012, at 7:22 PM, Lauren Gelman <gelman@blurryedge.com<mailto:
> gelman@blurryedge.com>> wrote:
> Unlinkable data is data that cannot reasonably be associated with an
> identifiable person or user agent.
> Lauren Gelman
> BlurryEdge Strategies
> 415-627-8512
> On Sep 18, 2012, at 8:05 AM, Ed Felten wrote:
> Sorry to repost this, but nobody has answered any of my questions about
> Option 1 for the unlinkability definition.
> Note to proponents of Option 1 (if any): If nobody can explain or clarify
> Option 1, that will presumably be used as an argument against Option 1 when
> decision time comes.
> ---------- Forwarded message ----------
> From: Ed Felten <ed@felten.com<mailto:ed@felten.com>>
> Date: Thu, Sep 13, 2012 at 5:03 PM
> Subject: definition of "unlinkable data" in the Compliance spec
> To: "<public-tracking@w3.org<mailto:public-tracking@w3.org>>" <
> public-tracking@w3.org<mailto:public-tracking@w3.org>>
> I have some questions about the Option 1 definition of "Unlinkable Data",
> section 3.6.1 in the Compliance spec editor's draft.   The definition is as
> follows [fixing typos]:
> A party renders a dataset unlinkable when it:
> 1. takes commercially reasonable steps to de-identify data such that there
> is confidence that it contains information which could not be linked to a
> specific user, user agent, or device in a production environment [2. and 3.
> aren't relevant to my questions]
> I have several questions about what this means.
> (A) Why does the definition talk about a process of making data
> unlinkable, instead of directly defining what it means for data to be
> unlinkable?  Some data needs to be processed to make it unlinkable, but
> some data is unlinkable from the start.  The definition should speak to
> both, even though unlinkable-from-the-start data hasn't gone through any
> kind of process.  Suppose FirstCorp collects data X; SecondCorp collects
> X+Y but then runs a process that discards Y to leave it with only X; and
> ThirdCorp collects X+Y+Z but then minimizes away Y+Z to end up with X.
>  Shouldn't these three datasets be treated the same--because they are the
> same X--despite having been through different processes, or no process at
> all?
> (B) Why "commercially reasonable" rather than just "reasonable"?  The term
> "reasonable" already takes into account all relevant factors.  Can somebody
> give an example of something that would qualify as "commercially
> reasonable" but not "reasonable", or vice versa?  If not, "commercially"
> only makes the definition harder to understand.
> (C) "there is confidence" seems to raise two questions.  First, who is it
> that needs to be confident?  Second, can the confidence be just an
> unsupported gut feeling of optimism, or does there need to be some valid
> reason for confidence?  Presumably the intent is that the party holding the
> data has justified confidence that the data cannot be linked, but if so it
> might be better to spell that out.
> (D) Why "it contains information which could not be linked" rather than
> the simpler "it could not be linked"?  Do the extra words add any meaning?
> (E) What does "in a production environment" add?  If the goal is to rule
> out results demonstrated in a research environment, I doubt this language
> would accomplish that goal, because all of the re-identification research I
> know of required less than a production environment.  If the goal is to
> rule out linking approaches that aren't at all practical, some other
> language would probably be better.
> (I don't have questions about the meaning of Option 2; which shouldn't be
> interpreted as a preference for or against Option 2.)
Received on Friday, 21 September 2012 17:01:17 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 17:39:00 UTC