Re: definition of "unlinkable data" in the Compliance spec from Jeffrey Chester on 2012-09-23 (public-tracking@w3.org from September 2012)

From: Jeffrey Chester <jeff@democraticmedia.org>
Date: Sun, 23 Sep 2012 16:01:33 -0400
To: Alan Chapell <achapell@chapellassociates.com>
Cc: Shane Wiley <wileys@yahoo-inc.com>, Ed Felten <ed@felten.com>, "Grimmelmann, James" <James.Grimmelmann@nyls.edu>, "<public-tracking@w3.org>" <public-tracking@w3.org>
Message-id: <4A677205-5F39-4437-AFC1-C781C34EE117@democraticmedia.org>
I appreciate thi and agree we need to not talk past each other.  We do need to have the transparency discussion (let's not debate the role of online lead gen in all this now).  As for the harms. let's all agree we disagree.  The disagreement reflects why we had to work on DNT. But I really appreciate your willingness to engage in a discussion of the issues.

Anyway, I hope my colleagues will help with raising additional specifics.  But let's take financial reporting.  And we need to know if these reflect standard body practices and identify the documents.  I would like to start off with a major advertiser use case.  Someone like Pepsi, which works with Yahoo and many others.   I don't expect specifics about Pepsi per se; just what a company expects and requires in terms of financial reporting retention and scope, given its multi-dimensional and long-term campaign objectives.

For such a major Fortune customer, what do they require be recorded, stored and/or shared for financial reporting purposes?  Does Company X require the agency/ad provider to hold specific record(s) demonstrating that an individual/unique user responded to an ad/marketing message or in the aggregate?  What are the specific data required for retention?   How long must the agency/ad provider retain such a record and in what form (i.e. does the content of the record change over time, for how long, and in what ways).  What does the client expect to receive from the ad provider--how detailed a report for their own financial purposes.  Might Company X have its own practices on financial records for ad billing and ad expense purposes which require a specified period, such as 60 days, for such information to be sent for review.   How long does the ad network/provider itself have to retain the specific records and do they change over time, based on industry practices.  

Btw, is there a best practices doc out there which can be sent to the list on financial reporting?

I hope we can begin a dialogue and this is just a start.  

Thanks,  

Jeff
 


Jeffrey Chester
Center for Digital Democracy
1621 Connecticut Ave, NW, Suite 550
Washington, DC 20009
www.democraticmedia.org
www.digitalads.org
202-986-2220

On Sep 23, 2012, at 3:24 PM, Alan Chapell wrote:

> Hi Jeff - 
> 
> I hope I'm not perceived as piling on here – but you asked me a very similar question about transparency into industry practices. I and others – including Shane – have provided this group with a great deal of insight into industry practices. But when asked to offer examples of 'harm', you offer "lead generation services" and the assertion that these harms "are evident [to you] and others." 
> 
> The problem I'm having with your response is: a) lead generation products and services have little if anything to do with the DNT debate, and b) stating that harms are "evident" is at best, non-responsive.
> 
> So I hope you'll understand that when you request "actual use cases for financial logging (and the other uses) with examples of the specific requirements of actual major advertisers and others", I'm going to ask for you to be clear in the rationale behind your request. If nothing else, that added transparency will help me ensure that my response is tailored to your question so we can stop talking past each other.
> 
> Thanks.
> 
> Alan
> 
> 
> From: Jeffrey Chester <jeff@democraticmedia.org>
> Date: Sunday, September 23, 2012 3:10 PM
> To: Shane Wiley <wileys@yahoo-inc.com>
> Cc: Ed Felten <ed@felten.com>, "Grimmelmann, James" <James.Grimmelmann@nyls.edu>, "<public-tracking@w3.org>" <public-tracking@w3.org>
> Subject: Re: definition of "unlinkable data" in the Compliance spec
> Resent-From: <public-tracking@w3.org>
> Resent-Date: Sun, 23 Sep 2012 19:11:49 +0000
> 
> Shane:
> 
> I didn't send as a press quote, but as an example of why we need meaningful anonymization and unlinkability, esp. responding to your harm question.  I hope you are committed to a meaningful discussion on these issues at the W3C.  If Yahoo isn't willing to discuss the harms and anon, issues, I hope others in the group will.
> 
> Jeff 
> 
> 
> 
> 
> Jeffrey Chester
> Center for Digital Democracy
> 1621 Connecticut Ave, NW, Suite 550
> Washington, DC 20009
> www.democraticmedia.org
> www.digitalads.org
> 202-986-2220
> 
> On Sep 23, 2012, at 2:59 PM, Shane Wiley wrote:
> 
>> Jeff,
>>  
>> We’re discussing anonymization and “unlinkable data” so I’m struggling to see how your response and press quotes have anything to do with this topic.  Industry has already agreed that all forms of cross-site profiling for ad targeting will cease with DNT (much like today’s current opt-out regimes but with added persistence and accessibility).  When you have more information on harms stemming from anonymization and “unlinkable data” issues please let me know.
>>  
>> Thank you,
>> Shane
>>  
>> From: Jeffrey Chester [mailto:jeff@democraticmedia.org] 
>> Sent: Sunday, September 23, 2012 10:43 AM
>> To: Shane Wiley
>> Cc: Ed Felten; Grimmelmann, James; <public-tracking@w3.org>
>> Subject: Re: definition of "unlinkable data" in the Compliance spec
>>  
>> Shane.
>>  
>> I disagree with your claim  that there is a "real divide" between work developed by the computer science scholars and online marketing practices.  I would find it odd for many high-tech companies to say this, esp. such they have major research initiatives in this area and hire so many scholars.
>>  
>> We need to have real use cases where we compare various privacy technology approaches to the issues outstanding here.
>>  
>> As for 3rd party data.  I leave it to other colleagues to tell you about subpoena's from gov't and other legal entities.  But the privacy harm to the user from the techniques used for some categories of digital targeting are evident to me and others.  The targeting of sub-prime mortgage loans online during the boom, btw, let alone current lead-generation techniques tied to consumer financial services, is one example of harms for me.  Although just a small example, this article from the NYT was based, in part, on my research:  http://www.nytimes.com/2012/08/19/business/electronic-scores-rank-consumers-by-potential-value.html?pagewanted=all
>>  
>>  
>> I look forward to working out these issues, and believe we can (combining both real-world determination and scholarly high mindness!) 
>>  
>> Jeff
>>  
>>  
>>  
>>  
>> On Sep 23, 2012, at 1:23 PM, Shane Wiley wrote:
>> 
>> 
>> Jeff,
>>  
>> The assertions are fair but were not intended to attack anyone personally and simply call out the very real divide between classroom scenarios and real-world ones.
>>  
>> As for “data”, could you please first provide the data of situations where 3rd party ad network data has been requested via subpoena and/or breached in a manner that there was some harm to a user?  This would allow us to have a more balanced conversation as to the appropriate measures to anonymize data in proportion to the real-world threat it exposes consumers to.
>>  
>> Thank you,
>> Shane
>>  
>> From: Jeffrey Chester [mailto:jeff@democraticmedia.org] 
>> Sent: Sunday, September 23, 2012 10:19 AM
>> To: Shane Wiley
>> Cc: Ed Felten; Grimmelmann, James; <public-tracking@w3.org>
>> Subject: Re: definition of "unlinkable data" in the Compliance spec
>>  
>> Can I suggest we refrain from making broad (and unfounded) assertions about what is "a highly simplified response...for a classroom" versus so-called real world.  I am sure you didn't mean it in a personal way, but it could be interpreted as dismissive of one of the leading experts in the field.
>>  
>> We simply haven't had the data presented to us from the online marketing industry about their actual practices so as a collective group within the W3C we can make informed decisions.  I would hope we would all recognize that the "truth" so to speak lies somewhere in between the artificial rhetorical poles we have set up here.  Amsterdam, and the coming weeks after, is a place to finally ensure that information is placed on the table so both Internet users and online ad companies can have a informed discussion.
>>  
>> Jeff
>>  
>> On Sep 23, 2012, at 1:05 PM, Shane Wiley wrote:
>> 
>> 
>> 
>> Ed,
>>  
>> I believe your approach makes inaccurate assumptions (IPv4 only) and requires the salt up-front which wouldn’t occur in the real-world.  This also fails to consider more advanced approaches to keyed one-way hashing/salting such as multi-permutator passes which is where most corporations are at today.
>>  
>> Again, a highly simplified response which is absolutely appropriate for a classroom or small lab setting but completely misses the mark in the real-world.
>>  
>> - Shane
>>  
>> From: Ed Felten [mailto:ed@felten.com] 
>> Sent: Saturday, September 22, 2012 12:53 PM
>> To: Shane Wiley
>> Cc: Grimmelmann, James; <public-tracking@w3.org>
>> Subject: Re: definition of "unlinkable data" in the Compliance spec
>>  
>> Reversing salted IP hashes requires 9 lines of code.
>>  
>> def reverseIpHash(salt, target):
>>    trialString = copy.copy(salt)
>>    trialString.extend(bytearray(4))
>>    for ip in range(256*256*256*256):
>>        trialIp = bytearray([(ip>>24)&0xff,(ip>>16)&0xff,(ip>>8)&0xff,ip&0xff])
>>        trialString = copy.copy(salt)
>>        trialString.extend(trialIp)
>>        if hashlib.sha1(trialString).digest()==target:
>>            return ip
>>  
>>  
>> On Sat, Sep 22, 2012 at 12:42 PM, Shane Wiley <wileys@yahoo-inc.com> wrote:
>> Ed,
>>  
>> Not “easy” if the salt/key is strongly protected and/or rotated/destroyed on a regular basis.  A dictionary attack requires either the raw data or access to the salt key – neither of which should be made easy/possible.  I tend to see the IP Address issue through the lens of IPv6 these days which further creates barriers to what you position as “easy to recover”. 
>>  
>> The advocacy side of the group tends to lean towards absolutist terms and solutions – the real world isn’t that easy even if it feels that way in a classroom or a small lab.
>>  
>> - Shane
>>  
>> From: Ed Felten [mailto:ed@felten.com] 
>> Sent: Saturday, September 22, 2012 5:30 AM
>> 
>> To: Shane Wiley
>> Cc: Grimmelmann, James; <public-tracking@w3.org>
>> Subject: Re: definition of "unlinkable data" in the Compliance spec
>>  
>> It's easy to recover hashed IP addresses if they're hashed as a whole (and not per-octet).   An straightforward dictionary attack will work against all IPv4 addresses.  Even a dumb brute-force search over the entire 32-bit space is feasible.  IPv6 is a bit more complicated--some will be recoverable and some won't, depending on details of address allocation.
>>  
>> On Fri, Sep 21, 2012 at 1:16 PM, Shane Wiley <wileys@yahoo-inc.com> wrote:
>> Ed,
>>  
>> I disagree with the concept of “easy to recover” as I’m not suggesting hashing the individual octets but rather the entire IP Address (not a single octet or individualized octet hashing) – especially as you apply this to IPv6.  With the appropriate level of access to raw and hashed datasets, the necessary tools, and the intent, some anonymization schemes can be hacked (dictionary attacks being the most straight forward).  I don’t believe the goal here is an absolutist one (aka “completed destruction of identifiers”) and that is why “commercially reasonable” is the appropriate outcome.
>>  
>> - Shane
>>  
>> From: Ed Felten [mailto:ed@felten.com] 
>> Sent: Friday, September 21, 2012 10:01 AM
>> To: Shane Wiley
>> Cc: Grimmelmann, James; <public-tracking@w3.org>
>> 
>> Subject: Re: definition of "unlinkable data" in the Compliance spec
>>  
>> By the way, hashing IP addresses (with or without salting) does not render them unlinkable.   After hashing, it's easy to recovery the original IP address.  The story is similar for other types of unique identifiers--there are ways to get to unlinkability, but hashing by itself won't be enough.
>>  
>> On Fri, Sep 21, 2012 at 12:01 PM, Shane Wiley <wileys@yahoo-inc.com> wrote:
>> <Ed - apologies for not getting back to you sooner - I was on vacation for the past week.>
>> 
>> James,
>> 
>> I like your approach the best and it was this perspective I was intending when writing the text that Ed is questioning.
>> 
>> The goal is to find the middle-ground between complete destruction of data and an unlinkable state that still allows for longitudinal consistency for analytical purposes BUT CANNOT be linked back to a production system such that the data could be used to modify a single user's experience.
>> 
>> For example, performing a one-way secret hash (salted hash) on identifiers (Cookie IDs, IP Addresses) and storing the resulting dataset in a logically/physically separate location from production data with strict access controls, policies, and employee education would meet the definition of "unlinkable" I'm aiming for.
>> 
>> - Shane
>> 
>> -----Original Message-----
>> From: Grimmelmann, James [mailto:James.Grimmelmann@nyls.edu]
>> Sent: Friday, September 21, 2012 8:14 AM
>> To: Lauren Gelman
>> Cc: Ed Felten; <public-tracking@w3.org>
>> Subject: Re: definition of "unlinkable data" in the Compliance spec
>> 
>> I really like Lauren's suggestion.  My only concern is that "reasonably" and "reasonable" have so many different meanings in legal settings that it could be ambiguous.  Sometimes an action is "reasonable" if a person who is ethical and cautious would do it: it's not reasonable to leave sharp tools lying around in a children's play area, or to invest a trust fund in marshmallows.  Sometimes it refers to what a rational non-expert would believe about the subject, so a court will uphold a jury verdict unless "no reasonable jury" could have reached the conclusion it did.  Sometimes it's about the norms and expectations of an industry.  An auction might need to be conducted in a "commercially reasonable" way, which means for example giving enough notice that there will be real competitive bidding, but not spending more than the property is worth.
>> 
>> I think this last sense is the most appropriate one in context.  So perhaps something like "data that cannot be associated with an identifiable person or user agent through commercially reasonable means."  That is, the question would be whether a normal business with normal resources and motivations would consider reidentifying the data to be feasible.
>> 
>> James
>> 
>> --------------------------------------------------
>> James Grimmelmann              Professor of Law
>> New York Law School                 (212) 431-2864
>> 185 West Broadway       james.grimmelmann@nyls.edu<mailto:james.grimmelmann@nyls.edu>
>> New York, NY 10013    http://james.grimmelmann.net
>> 
>> On Sep 20, 2012, at 7:22 PM, Lauren Gelman <gelman@blurryedge.com<mailto:gelman@blurryedge.com>> wrote:
>> 
>> 
>> Unlinkable data is data that cannot reasonably be associated with an identifiable person or user agent.
>> 
>> Lauren Gelman
>> BlurryEdge Strategies
>> 415-627-8512
>> 
>> On Sep 18, 2012, at 8:05 AM, Ed Felten wrote:
>> 
>> Sorry to repost this, but nobody has answered any of my questions about Option 1 for the unlinkability definition.
>> 
>> Note to proponents of Option 1 (if any): If nobody can explain or clarify Option 1, that will presumably be used as an argument against Option 1 when decision time comes.
>> 
>> ---------- Forwarded message ----------
>> From: Ed Felten <ed@felten.com<mailto:ed@felten.com>>
>> Date: Thu, Sep 13, 2012 at 5:03 PM
>> Subject: definition of "unlinkable data" in the Compliance spec
>> To: "<public-tracking@w3.org<mailto:public-tracking@w3.org>>" <public-tracking@w3.org<mailto:public-tracking@w3.org>>
>> 
>> 
>> I have some questions about the Option 1 definition of "Unlinkable Data", section 3.6.1 in the Compliance spec editor's draft.   The definition is as follows [fixing typos]:
>> 
>> A party renders a dataset unlinkable when it:
>> 1. takes commercially reasonable steps to de-identify data such that there is confidence that it contains information which could not be linked to a specific user, user agent, or device in a production environment [2. and 3. aren't relevant to my questions]
>> 
>> I have several questions about what this means.
>> (A) Why does the definition talk about a process of making data unlinkable, instead of directly defining what it means for data to be unlinkable?  Some data needs to be processed to make it unlinkable, but some data is unlinkable from the start.  The definition should speak to both, even though unlinkable-from-the-start data hasn't gone through any kind of process.  Suppose FirstCorp collects data X; SecondCorp collects X+Y but then runs a process that discards Y to leave it with only X; and ThirdCorp collects X+Y+Z but then minimizes away Y+Z to end up with X.  Shouldn't these three datasets be treated the same--because they are the same X--despite having been through different processes, or no process at all?
>> (B) Why "commercially reasonable" rather than just "reasonable"?  The term "reasonable" already takes into account all relevant factors.  Can somebody give an example of something that would qualify as "commercially reasonable" but not "reasonable", or vice versa?  If not, "commercially" only makes the definition harder to understand.
>> (C) "there is confidence" seems to raise two questions.  First, who is it that needs to be confident?  Second, can the confidence be just an unsupported gut feeling of optimism, or does there need to be some valid reason for confidence?  Presumably the intent is that the party holding the data has justified confidence that the data cannot be linked, but if so it might be better to spell that out.
>> (D) Why "it contains information which could not be linked" rather than the simpler "it could not be linked"?  Do the extra words add any meaning?
>> (E) What does "in a production environment" add?  If the goal is to rule out results demonstrated in a research environment, I doubt this language would accomplish that goal, because all of the re-identification research I know of required less than a production environment.  If the goal is to rule out linking approaches that aren't at all practical, some other language would probably be better.
>> 
>> (I don't have questions about the meaning of Option 2; which shouldn't be interpreted as a preference for or against Option 2.)
>> 
>> 
>> 
>> 
>> 
>>  
>>  
>>  
>>  
>>  
>
Received on Sunday, 23 September 2012 20:02:31 UTC