Re: my questions from Dan Auerbach on 2012-10-22 (public-tracking@w3.org from October 2012)

From: Dan Auerbach <dan@eff.org>
Date: Mon, 22 Oct 2012 15:48:29 -0700
To: public-tracking@w3.org
Message-ID: <5085CD3D.8020507@eff.org>
Great -- are you planning to write up a proposal for how URL filtering
ought to work? Keeping only the domain would be a start, though I'm
still not convinced this would provide sufficient anonymity. If you
think it should be standard practice to keep more than the domain, we
should get into the nitty gritty of what you'd like to keep.

In addition to URLs, and parameters sent along with HTTP requests,
another tricky area I'd like to highlight is data gathered via
Javascript. It'd be great to know what the plans are for anonymizing
this data.

On 10/22/2012 02:47 PM, Shane Wiley wrote:
>
> Dan,
>
>  
>
> I believe URL filtering will be an equally important element of an
> anonymization approach.  For a 3^rd party to receive search queries in
> a URL as you suggest, they would need to be on the SRP (Search Results
> Page) which is incredibly rare in the real-world but there are still
> web sites out there that, against industry best practice, pass user
> details in the query string and those need to discovered and removed.
>

I think this is a disjunctive statement: they need to be in SRP, OR have
query information passed to them via URL parameters (or POST data) by
actors operating "against industry best practice". It's also important
to keep in mind that even obfuscated or encrypted personal information
collected via GET URL params, POST data, or Javascript represent
linkable data to anyone able to decrypt them. Also: I don't know the
current figure for average words-per-query across search engines, but
it's clear that hashing query terms will not be effective given that a
significant percentage of queries could be brute-forced.

>  
>
> - Shane
>
>  
>
> *From:*Dan Auerbach [mailto:dan@eff.org]
> *Sent:* Monday, October 22, 2012 1:57 PM
> *To:* public-tracking@w3.org
> *Subject:* Re: my questions
>
>  
>
> You cannot be serious about anonymizing data if you are keeping full
> URLs (and the information derived from them), along with the ability
> to associate those URLs as coming from the same user, e.g. via a hash
> of a cookie. URLs are well-known to have search terms in them, for
> starters. By your proposal, the leaked AOL search query data set from
> 2006 -- a data set that has been used to link user 4417749 to Thelma
> Arnold -- should be considered data that has been rendered
> "unlinkable". (see e.g.
> http://www.nytimes.com/2006/08/09/technology/09aol.html?pagewanted=all&_r=0)
>
> If the industry is interested in "weakly anonymized" data, then, first
> of all, let's call it by a name like that. But no matter the name,
> merely hashing cookies and IP addresses does not represent a
> good-faith effort to anonymize data. I've written before that I'd be
> very interested to work with folks on this issue, and am happy to have
> a detailed discussion based around hypotheticals. We can discuss
> logging pipelines, and how to properly segment and anonymize raw logs
> without losing the ability to do the things you would like to do. I'm
> sympathetic to the fact that this might take time to implement and
> would present a cost to companies, but I think taking the effort to do
> this properly is a reasonable thing to ask of companies that are
> serious about respecting DNT. Moreover, for companies that choose to
> simply delete the data entirely after a small retention period, this
> issue won't come up at all.
>
> Shane, would you be willing to form a small working group so that we
> can talk through this? I'm happy to discuss on-list as well, but so
> far I feel like my attempts to engage in this debate have been brushed
> aside -- I don't think that's the best way forward.
>
> Dan
>
> On 10/19/2012 03:35 PM, Shane Wiley wrote:
>
>     Vincent,
>
>      
>
>     This would definitely be an option for companies but would destroy more value in the data than anonymizing the cookie ID and IP Address individually.  Another option would be to drop the IP Address altogether and only retain the high-level associated user agent details (browser type/version, OS) and resulting geo data (country, state, city).  And then only anonymize the cookie ID (if those are the only two unique identifiers in the record).  As the cookie ID will not be the same across devices or user accounts on the same system, it still has the isolation you're seeking but provides better longitudinal consistency for cross session reporting.  The important detail is that the resulting ID not match anything in production and that systems/policies/processes bar employees from ever using this data outside of the reporting sandbox.
>
>      
>
>     - Shane
>
>      
>
>     -----Original Message-----
>
>     From: TOUBIANA, VINCENT (VINCENT) [mailto:Vincent.Toubiana@alcatel-lucent.com] 
>
>     Sent: Friday, October 19, 2012 5:20 AM
>
>     To: Shane Wiley; Lauren Gelman
>
>     Cc: Ed Felten; public-tracking@w3.org <mailto:public-tracking@w3.org>
>
>     Subject: RE: my questions
>
>      
>
>     Shane,
>
>      
>
>     In Amsterdam you detailed a sanitizing process that would meet your definition of unlinkability: hashing cookies and IP addresses. Would that be ok to first concatenate the IP address with the cookies and then hash the result instead? 
>
>     That would at least enable session unlinkability for individuals who use different browsing profiles and/or browsers.
>
>      
>
>     Thank you,
>
>      
>
>     Vincent
>
>      
>
>     ________________________________________
>
>     From: Shane Wiley [wileys@yahoo-inc.com <mailto:wileys@yahoo-inc.com>]
>
>     Sent: Thursday, October 18, 2012 6:51 PM
>
>     To: Lauren Gelman
>
>     Cc: Ed Felten; public-tracking@w3.org <mailto:public-tracking@w3.org>
>
>     Subject: RE: my questions
>
>      
>
>     Correct - unlinkable data is outside the scope of the spec.
>
>      
>
>     - Shane
>
>      
>
>     From: Lauren Gelman [mailto:gelman@blurryedge.com]
>
>     Sent: Thursday, October 18, 2012 9:47 AM
>
>     To: Shane Wiley
>
>     Cc: Ed Felten; public-tracking@w3.org <mailto:public-tracking@w3.org>
>
>     Subject: Re: my questions
>
>      
>
>      
>
>     Isn't unlinkable from the start data not covered by the spec?
>
>      
>
>     Lauren Gelman
>
>     BlurryEdge Strategies
>
>     415-627-8512
>
>      
>
>     On Oct 16, 2012, at 11:18 PM, Shane Wiley wrote:
>
>      
>
>      
>
>     Ed,
>
>      
>
>     Here are the direct responses to your earlier questions on Unlinkability:
>
>      
>
>     (A) Why does the definition talk about a process of making data unlinkable, instead of directly defining what it means for data to be unlinkable?  Some data needs to be processed to make it unlinkable, but some data is unlinkable from the start.  The definition should speak to both, even though unlinkable-from-the-start data hasn't gone through any kind of process.  Suppose FirstCorp collects data X; SecondCorp collects X+Y but then runs a process that discards Y to leave it with only X; and ThirdCorp collects X+Y+Z but then minimizes away Y+Z to end up with X.  Shouldn't these three datasets be treated the same--because they are the same X--despite having been through different processes, or no process at all?
>
>      
>
>      
>
>     [I believe the definition is subsumed in process (breaking the link with production systems) and is already called out.]
>
>      
>
>     (B) Why "commercially reasonable" rather than just "reasonable"?  The term "reasonable" already takes into account all relevant factors.  Can somebody give an example of something that would qualify as "commercially reasonable" but not "reasonable", or vice versa?  If not, "commercially" only makes the definition harder to understand.
>
>      
>
>     [Commercially reasonable takes into account more considerations of what in reasonable to "any person" and what would be reasonable to consider a company to be able to perform.  As this is fairly standard language in contracts it feels appropriate to use this here as well.]
>
>      
>
>     (C) "there is confidence" seems to raise two questions.  First, who is it that needs to be confident?  Second, can the confidence be just an unsupported gut feeling of optimism, or does there need to be some valid reason for confidence?  Presumably the intent is that the party holding the data has justified confidence that the data cannot be linked, but if so it might be better to spell that out.
>
>      
>
>     [Confidence - the company representing that they have achieved unlinkabliity.  I'm okay with adding some degree of diligence be required here versus "unsupported gut feeling of optimism".]
>
>      
>
>     (D) Why "it contains information which could not be linked" rather than the simpler "it could not be linked"?  Do the extra words add any meaning?
>
>      
>
>     [I believe both options work but the "contains information" highlights issues like URL details better than the simpler form you've offered.]
>
>      
>
>     (E) What does "in a production environment" add?  If the goal is to rule out results demonstrated in a research environment, I doubt this language would accomplish that goal, because all of the re-identification research I know of required less than a production environment.  If the goal is to rule out linking approaches that aren't at all practical, some other language would probably be better.
>
>      
>
>      
>
>     [The goal is to prohibit production use of retained data.  This is of course a "use based" approach to solving the issue here versus a "collection based" approach.  My hope is that this approach finds the sweet spot between proportionally reducing consumer privacy risks and at the same time allowing the data to be used for anonymous/aggregated reporting/analytics/research.  Anonymization/aggregation approaches discussed to data such as K-Anonymity destroy a considerable amount of value in data - as well as arbitrarily force non-DNT data to also be funneled into these approaches for consistency in analytics.]
>
>      
>
>     - Shane
>
>      
>
>     From: Ed Felten [mailto:ed@felten.com]
>
>     Sent: Wednesday, October 03, 2012 8:20 AM
>
>     To: Shane Wiley
>
>     Subject: re: my questions
>
>      
>
>     Sorry, I don't see a reply from you that addresses my questions specifically.   You did say what the general goal of your proposal, but I don't think you addressed my specific questions.   If I missed something in my quick review of your messages in the thread--which is quite possible--please point me to the right place.
>
>      
>
>      
>
>
>
>
> -- 
> Dan Auerbach
> Staff Technologist
> Electronic Frontier Foundation
> dan@eff.org <mailto:dan@eff.org>
> 415 436 9333 x134


-- 
Dan Auerbach
Staff Technologist
Electronic Frontier Foundation
dan@eff.org
415 436 9333 x134
Received on Monday, 22 October 2012 22:48:55 UTC