Re: my questions from Dan Auerbach on 2012-10-29 (public-tracking@w3.org from October 2012)

From: Dan Auerbach <dan@eff.org>
Date: Mon, 29 Oct 2012 16:10:44 -0700
To: public-tracking@w3.org
Message-ID: <508F0CF4.2040202@eff.org>
Shane,

I think there might be some confusion about the burden of proof here.
Consider the question: what is the frequency with which large
pseudonymous data sets contain pseudonyms that can be reidentified? So
far the empirical answer to that question is 100% -- we both agree on
the AOL data set and it's the only data set we've considered. If you are
suggesting that that risk is low, the burden of proof is on you to
provide more data sets.

Now, there is a second question of whether only a handful of pseudonyms
being re-identified is good enough for privacy (say, under 1%). Are you
suggesting that this is the case? That so long as only a few individuals
here and there can be identified (say, within a year of the data
becoming public, since we don't know the long-term repercussions),
that's good enough for data to still be considered "unlinkable" and
meets a bar of user privacy? If so that's quite a different perspective
on privacy than I'm used to, and so I'd ask that you make that belief
explicit, so that we can talk through the two points of view.

Dan

On 10/29/2012 02:27 PM, Shane Wiley wrote:
>
> Dan,
>
>  
>
> I could also ask you to provide the opposite – please provide examples
> where breached data has become public that had been anonymized and
> identifiable data was able to be re-identified from the dataset (AOL =
> 600K records, 2-3 re-identified).  As I’m not aware of other real
> examples in this area, I’ll stand by my proposal and stated position
> on the frequency of this risk.
>
>  
>
> - Shane
>
>  
>
> *From:*Dan Auerbach [mailto:dan@eff.org]
> *Sent:* Monday, October 29, 2012 4:41 PM
> *To:* public-tracking@w3.org
> *Subject:* Re: my questions
>
>  
>
> Shane,
>
> Thanks for outlining in a bit more detail what you have in mind. I am
> optimistic that we can find the right approach here. As I've said, I
> agree with David that we shouldn't be too technically specific in the
> normative text, but this discussion still helps us shape the ballpark
> of what's reasonable.
>
> On 10/23/2012 10:07 PM, Shane Wiley wrote:
>
>     Dan,
>
>      
>
>     While we disagree on scope, I believe we’re narrowing in on a path
>     to closure on this language and issue.
>
>      
>
>     If the core “harm” in this case is having a “unique identifier” +
>     “whole URL” then the anonymization approach should address both
>     elements.  As I’ve stated earlier (which you’ve already stated you
>     disagree), I believe a technical process (one-way secret hash)
>     plus access controls, stated policies, employee education, and
>     process requirements is enough to meet the requirements of
>     anonymization/unlinkability (use centric – would solve breach risk). 
>
>      
>
>     There are situations where a URL can provide clues - when combined
>     with a consistent identifier across events – that may allow query
>     string details to be leveraged to provide opportunities for
>     reverse engineering identity at an incredibly small rate (the AOL
>     incident).
>
> I don't think "at an incredibly small rate" is fair. How many large
> data breaches have there been that have been public? How many have
> contained data that could be re-identified? The first question
> establishes the denominator of the fraction in question, and so to
> credibly claim this is a small rate, you should provide lots of
> examples of public data breaches that contain HTTP data, query data,
> etc. where no-one has been able to re-identify people, or devices.
>
>
>
>   Organizations may take steps to strip common elements from query
> strings such as username, name, ID, GUID, password, address, phone,
> etc. (rarely passed in the URL but it can still happen).  While it’ll
> be difficult to build a perfect filtering system, a solid approach
> here in combination with unique identifier anonymization (along with
> access controls, policies, education, and process) will develop a fair
> “unlinkability” solution.
>
> I agree a perfect system is tricky. It sounds like the process you
> outline is to manually build a list of problem tokens, knowing where
> they will appear in the data set, and strip them. I think there is a
> danger of PII sneaking in given the manual nature of this solution.
> Let's push this a bit further: we have lots of information-theoretic
> tools at our disposal to measure the entropy of tokens in the data. To
> make sure that our manual list didn't miss anything -- zip codes, say
> -- we could automate a process in which if a combination of tokens
> associated with a request appears infrequently enough in the data set,
> we should either discard the data or render it low entropy so that it
> is indistinguishable from other sets of tokens. This will guarantee
> that we don't miss anything, but will still allow for common sets of
> tokens (Firefox UA, example.com/popularurl, Friday, Sports Vertical)
> to be unchanged, and so you'll be able to know how many people
> visiting example.com/popularurl had a sports vertical vs a medical
> vertical. But you'd never be surprised by accidentally saving PII or
> other high-entropy data.
>
>
>
>  
>
> - Shane
>
>  
>
> *From:*Dan Auerbach [mailto:dan@eff.org]
> *Sent:* Tuesday, October 23, 2012 3:29 PM
> *To:* public-tracking@w3.org <mailto:public-tracking@w3.org>
> *Subject:* Re: my questions
>
>  
>
> Hi Shane,
>
> I think this is an incredibly important section to get right, and it
> is absolutely within scope to make sure that we come to an
> understanding of what companies will be doing in the real world to
> anonymize data. Merely agreeing on some language without this deeper
> understanding just kicks the can down the road and leaves everyone
> wondering about what is and what isn't acceptable. This isn't good for
> companies who are worried about regulatory enforcement, isn't good for
> users worried about privacy, and isn't good for regulators who are
> potentially trying to sort out what constitutes an acceptable
> implementation. I second Jonathan's point that we should be actually
> having this debate and not talking around it or distracting through
> small quibbles about e.g. SRPs in particular.
>
> I think you and I might be in agreement about how we should structure
> the document: let's have normative text that does not discuss specific
> implementations, alongside non-normative examples that do offer some
> guidance. One such non-normative example could be 1024-unlinkability,
> and another could be a negative example about how hashing IP addresses
> and cookies does NOT provide sufficient protection. I think these
> non-normative examples ought to give shape to the normative text, and
> we should have one or two more.
>
> Keeping in mind that we agree that normative text should not be overly
> constrained with implementation details, I still think it behooves you
> to provide a more detailed sketch about what you have in mind. For
> one, I'm sure everyone here is very curious about how this will work
> in practice, and users should have confidence that folks abiding by
> the DNT spec have their head on their shoulders when it comes to
> properly anonymizing data.
>
> One final note about the normative text: the FTC text [1] seems very
> similar and to be the most clear to me -- it is not far from the DAA
> language, and so I'd imagine industry folks wouldn't be opposed to
> just inserting that text instead. It seems much simpler and cleaner
> than what is in place now; do we need to reinvent the wheel here?
>
> Dan
>
> [1] http://ftc.gov/os/2012/03/120326privacyreport.pdf:
>
> Data is not “reasonably linkable” to the extent that a company:  
> (1) takes reasonable measures to ensure that the data is
> de-identified; (2) publicly commits not to try to reidentify the data;
> and (3) contractually prohibits downstream recipients from trying to
> re-identify the data.
>
> On 10/23/2012 07:44 AM, Shane Wiley wrote:
>
>     Jonathan,
>
>
>     Continued personal attacks from you aside (hopefully the co-chairs
>     and W3C staff will address this), I’d like to address the
>     substance of the issues under discussion.
>
>      
>
>     *Question:*  Should the DNT discussion also address anonymization
>     and unlinkability? 
>
>     *Answer:*  The Working Group appears to agree these are important
>     concepts so we’re referring to them to highlight that once a
>     dataset has reached the “unlinkable” or “anonymous” state, then it
>     is no longer in the scope of DNT.  While we appear to agree on
>     this point at a high-level, we clearly disagree on the details.  I
>     would suggest that this is not the correct forum to take the deep
>     anonymization dive to develop a highly prescriptive outcome.  We
>     are too far apart on the details and we can simply refer to
>     anonymization and unlinkability without defining these in deep
>     detail.  This allows the debate to continue outside the confines
>     of DNT (where it will take considerable time to find more common
>     ground but I believe its achievable with more focused discussion
>     such as Dan suggested).  My goal is to develop a standard that is
>     implementable and addresses those issues closest to the genesis of
>     the DNT debate  -- but don’t try to solve all online privacy
>     issues in a single pass (however attractive that notion is).
>
>      
>
>     I would recommend we speak to many of the issues you and Dan have
>     referred to as non-normative text to highlight areas where
>     aggressive approaches clearly meet “anonymization” or
>     “Unlinkability” (k-anonymity, URL filtering, super campaign
>     structures, client-side storage, etc.) but not go so far as to
>     declare these as the only possible approaches in normative text.
>
>      
>
>     I doubt you’ll take this opportunity to meet in the middle (you’re
>     reference to a “retort”) but I’m hopeful the Working Group sees
>     this as a clear path forward to conclusion.   
>
>      
>
>     - Shane
>
>      
>
>     *From:*Jonathan Mayer [mailto:jmayer@stanford.edu]
>     *Sent:* Monday, October 22, 2012 10:24 PM
>     *To:* Shane Wiley
>     *Cc:* Dan Auerbach; public-tracking@w3.org
>     <mailto:public-tracking@w3.org>
>     *Subject:* Re: my questions
>
>      
>
>     Shane,
>
>      
>
>     For want of a better metaphor: you are the climate change skeptic
>     of computer privacy.  Against an overwhelming consensus in the
>     scientific community, you persist in claiming that
>     re-pseudonymization significantly mitigates privacy risks.  I am
>     not aware of a single serious researcher who shares your radical view.
>
>      
>
>     Dan correctly challenged you.  He observed that a pseudonymous
>     browsing history is often identified or identifiable.  It is not
>     "anonymous" or "unlinkable" as those terms are ordinarily used.
>      One way in which a pseudonymous browsing history may be
>     identified or identifiable is information leakage from a
>     first-party website's content.  Dan provided a helpful analogy to
>     the AOL search results debacle, which showed how easily a
>     semantically rich pseudonymous dataset can be re-identified.  (So
>     easy to do, even the New York Times can do it!)
>
>      
>
>     Like any good denialist, instead of earnestly engaging with your
>     critic, you straw manned his claim.  You emphasized the rarity of
>     search result pages leaking information to third-party
>     websites—which was far from Dan's central concern.  (You're wrong
>     on this too, by the way.  Research by Krishnamurthy and Wills
>     showed that search queries frequently leak to third parties.)  I
>     refocused on the relevant issues, which are 1) whether
>     pseudonymous browsing histories are identified or identifiable
>     (they are), and 2) whether identifying information leaks from
>     first-party websites to third-party websites (it does).
>
>      
>
>     If pattern holds, you'll send a response of nonsensical bluster.
>      Don't expect a retort.  Unlike some of the more patient members
>     of the group, I long ago ceased pretending you're negotiating in
>     good faith.
>
>      
>
>     Jonathan
>
>      
>
>     On Monday, October 22, 2012 at 5:32 PM, Shane Wiley wrote:
>
>         Jonathan,
>
>
>         We were speaking to SRPs but your research doesn’t appear to
>         call this out.  Can you please show where in your research it
>         was “incredibly common” for SPRs to have 3^rd party tags?  In
>         was in that context that I made my comment so it would be
>         helpful if you could respond in the same context and not use
>         my words more broadly.
>
>          
>
>         Thank you,
>
>         Shane
>
>          
>
>         *From:*Jonathan Mayer [mailto:jmayer@stanford.edu]
>         *Sent:* Monday, October 22, 2012 3:57 PM
>         *To:* Shane Wiley
>         *Cc:* Dan Auerbach; public-tracking@w3.org
>         <mailto:public-tracking@w3.org>
>         *Subject:* Re: my questions
>
>          
>
>         Last fall I conducted an empirical measurement of identifying
>         information leakage from first-party websites to third-party
>         websites.  It was not "incredibly rare in the real-world," but
>         rather, incredibly common in the real world.
>         See https://cyberlaw.stanford.edu/node/6740.  Other
>         researchers have attained similar results.
>
>          
>
>         For a higher-level discussion of the myriad ways in which
>         pseudonymous tracking data is identified or identifiable, I
>         highly recommend Arvind Narayanan's piece "There is no such
>         thing as anonymous online tracking."
>          See https://cyberlaw.stanford.edu/node/6701.
>
>          
>
>         In short: there is overwhelming evidence that a pseudonymous
>         browsing history is not, within any plain meaning, "anonymous"
>         or "unlinkable."
>
>          
>
>         Jonathan
>
>          
>
>         On Monday, October 22, 2012 at 2:47 PM, Shane Wiley wrote:
>
>             Dan,
>
>              
>
>             I believe URL filtering will be an equally important
>             element of an anonymization approach.  For a 3^rd party to
>             receive search queries in a URL as you suggest, they would
>             need to be on the SRP (Search Results Page) which is
>             incredibly rare in the real-world but there are still web
>             sites out there that, against industry best practice, pass
>             user details in the query string and those need to
>             discovered and removed.
>
>              
>
>             - Shane
>
>              
>
>             *From:*Dan Auerbach [mailto:dan@eff.org]
>             *Sent:* Monday, October 22, 2012 1:57 PM
>             *To:* public-tracking@w3.org <mailto:public-tracking@w3.org>
>             *Subject:* Re: my questions
>
>              
>
>             You cannot be serious about anonymizing data if you are
>             keeping full URLs (and the information derived from them),
>             along with the ability to associate those URLs as coming
>             from the same user, e.g. via a hash of a cookie. URLs are
>             well-known to have search terms in them, for starters. By
>             your proposal, the leaked AOL search query data set from
>             2006 -- a data set that has been used to link user 4417749
>             to Thelma Arnold -- should be considered data that has
>             been rendered "unlinkable". (see e.g.
>             http://www.nytimes.com/2006/08/09/technology/09aol.html?pagewanted=all&_r=0)
>
>             If the industry is interested in "weakly anonymized" data,
>             then, first of all, let's call it by a name like that. But
>             no matter the name, merely hashing cookies and IP
>             addresses does not represent a good-faith effort to
>             anonymize data. I've written before that I'd be very
>             interested to work with folks on this issue, and am happy
>             to have a detailed discussion based around hypotheticals.
>             We can discuss logging pipelines, and how to properly
>             segment and anonymize raw logs without losing the ability
>             to do the things you would like to do. I'm sympathetic to
>             the fact that this might take time to implement and would
>             present a cost to companies, but I think taking the effort
>             to do this properly is a reasonable thing to ask of
>             companies that are serious about respecting DNT. Moreover,
>             for companies that choose to simply delete the data
>             entirely after a small retention period, this issue won't
>             come up at all.
>
>             Shane, would you be willing to form a small working group
>             so that we can talk through this? I'm happy to discuss
>             on-list as well, but so far I feel like my attempts to
>             engage in this debate have been brushed aside -- I don't
>             think that's the best way forward.
>
>             Dan
>
>             On 10/19/2012 03:35 PM, Shane Wiley wrote:
>
>                 Vincent,
>
>                  
>
>                 This would definitely be an option for companies but would destroy more value in the data than anonymizing the cookie ID and IP Address individually.  Another option would be to drop the IP Address altogether and only retain the high-level associated user agent details (browser type/version, OS) and resulting geo data (country, state, city).  And then only anonymize the cookie ID (if those are the only two unique identifiers in the record).  As the cookie ID will not be the same across devices or user accounts on the same system, it still has the isolation you're seeking but provides better longitudinal consistency for cross session reporting.  The important detail is that the resulting ID not match anything in production and that systems/policies/processes bar employees from ever using this data outside of the reporting sandbox.
>
>                  
>
>                 - Shane
>
>                  
>
>                 -----Original Message-----
>
>                 From: TOUBIANA, VINCENT (VINCENT) [mailto:Vincent.Toubiana@alcatel-lucent.com] 
>
>                 Sent: Friday, October 19, 2012 5:20 AM
>
>                 To: Shane Wiley; Lauren Gelman
>
>                 Cc: Ed Felten; public-tracking@w3.org <mailto:public-tracking@w3.org>
>
>                 Subject: RE: my questions
>
>                  
>
>                 Shane,
>
>                  
>
>                 In Amsterdam you detailed a sanitizing process that would meet your definition of unlinkability: hashing cookies and IP addresses. Would that be ok to first concatenate the IP address with the cookies and then hash the result instead? 
>
>                 That would at least enable session unlinkability for individuals who use different browsing profiles and/or browsers.
>
>                  
>
>                 Thank you,
>
>                  
>
>                 Vincent
>
>                  
>
>                 ________________________________________
>
>                 From: Shane Wiley [wileys@yahoo-inc.com <mailto:wileys@yahoo-inc.com>]
>
>                 Sent: Thursday, October 18, 2012 6:51 PM
>
>                 To: Lauren Gelman
>
>                 Cc: Ed Felten; public-tracking@w3.org <mailto:public-tracking@w3.org>
>
>                 Subject: RE: my questions
>
>                  
>
>                 Correct - unlinkable data is outside the scope of the spec.
>
>                  
>
>                 - Shane
>
>                  
>
>                 From: Lauren Gelman [mailto:gelman@blurryedge.com]
>
>                 Sent: Thursday, October 18, 2012 9:47 AM
>
>                 To: Shane Wiley
>
>                 Cc: Ed Felten; public-tracking@w3.org <mailto:public-tracking@w3.org>
>
>                 Subject: Re: my questions
>
>                  
>
>                  
>
>                 Isn't unlinkable from the start data not covered by the spec?
>
>                  
>
>                 Lauren Gelman
>
>                 BlurryEdge Strategies
>
>                 415-627-8512
>
>                  
>
>                 On Oct 16, 2012, at 11:18 PM, Shane Wiley wrote:
>
>                  
>
>                  
>
>                 Ed,
>
>                  
>
>                 Here are the direct responses to your earlier questions on Unlinkability:
>
>                  
>
>                 (A) Why does the definition talk about a process of making data unlinkable, instead of directly defining what it means for data to be unlinkable?  Some data needs to be processed to make it unlinkable, but some data is unlinkable from the start.  The definition should speak to both, even though unlinkable-from-the-start data hasn't gone through any kind of process.  Suppose FirstCorp collects data X; SecondCorp collects X+Y but then runs a process that discards Y to leave it with only X; and ThirdCorp collects X+Y+Z but then minimizes away Y+Z to end up with X.  Shouldn't these three datasets be treated the same--because they are the same X--despite having been through different processes, or no process at all?
>
>                  
>
>                  
>
>                 [I believe the definition is subsumed in process (breaking the link with production systems) and is already called out.]
>
>                  
>
>                 (B) Why "commercially reasonable" rather than just "reasonable"?  The term "reasonable" already takes into account all relevant factors.  Can somebody give an example of something that would qualify as "commercially reasonable" but not "reasonable", or vice versa?  If not, "commercially" only makes the definition harder to understand.
>
>                  
>
>                 [Commercially reasonable takes into account more considerations of what in reasonable to "any person" and what would be reasonable to consider a company to be able to perform.  As this is fairly standard language in contracts it feels appropriate to use this here as well.]
>
>                  
>
>                 (C) "there is confidence" seems to raise two questions.  First, who is it that needs to be confident?  Second, can the confidence be just an unsupported gut feeling of optimism, or does there need to be some valid reason for confidence?  Presumably the intent is that the party holding the data has justified confidence that the data cannot be linked, but if so it might be better to spell that out.
>
>                  
>
>                 [Confidence - the company representing that they have achieved unlinkabliity.  I'm okay with adding some degree of diligence be required here versus "unsupported gut feeling of optimism".]
>
>                  
>
>                 (D) Why "it contains information which could not be linked" rather than the simpler "it could not be linked"?  Do the extra words add any meaning?
>
>                  
>
>                 [I believe both options work but the "contains information" highlights issues like URL details better than the simpler form you've offered.]
>
>                  
>
>                 (E) What does "in a production environment" add?  If the goal is to rule out results demonstrated in a research environment, I doubt this language would accomplish that goal, because all of the re-identification research I know of required less than a production environment.  If the goal is to rule out linking approaches that aren't at all practical, some other language would probably be better.
>
>                  
>
>                  
>
>                 [The goal is to prohibit production use of retained data.  This is of course a "use based" approach to solving the issue here versus a "collection based" approach.  My hope is that this approach finds the sweet spot between proportionally reducing consumer privacy risks and at the same time allowing the data to be used for anonymous/aggregated reporting/analytics/research.  Anonymization/aggregation approaches discussed to data such as K-Anonymity destroy a considerable amount of value in data - as well as arbitrarily force non-DNT data to also be funneled into these approaches for consistency in analytics.]
>
>                  
>
>                 - Shane
>
>                  
>
>                 From: Ed Felten [mailto:ed@felten.com]
>
>                 Sent: Wednesday, October 03, 2012 8:20 AM
>
>                 To: Shane Wiley
>
>                 Subject: re: my questions
>
>                  
>
>                 Sorry, I don't see a reply from you that addresses my questions specifically.   You did say what the general goal of your proposal, but I don't think you addressed my specific questions.   If I missed something in my quick review of your messages in the thread--which is quite possible--please point me to the right place.
>
>                  
>
>                  
>
>              
>
>             -- 
>
>             Dan Auerbach
>
>             Staff Technologist
>
>             Electronic Frontier Foundation
>
>             dan@eff.org <mailto:dan@eff.org>
>
>             415 436 9333 x134
>
>          
>
>      
>
>
>
>
>
> -- 
> Dan Auerbach
> Staff Technologist
> Electronic Frontier Foundation
> dan@eff.org <mailto:dan@eff.org>
> 415 436 9333 x134
>
>
>
>
> -- 
> Dan Auerbach
> Staff Technologist
> Electronic Frontier Foundation
> dan@eff.org <mailto:dan@eff.org>
> 415 436 9333 x134


-- 
Dan Auerbach
Staff Technologist
Electronic Frontier Foundation
dan@eff.org
415 436 9333 x134
Received on Monday, 29 October 2012 23:11:13 UTC