RE: de-identification text for Wednesday's call from Justin Brookman on 2013-04-03 (public-tracking@w3.org from April 2013)

From: Justin Brookman <jbrookman@cdt.org>
Date: Tue, 02 Apr 2013 22:33:26 -0400
To: public-tracking@w3.org
Message-ID: <20130403023326.ed1df338@mail.maclaboratory.net>
If this is all you're doing with it, maybe there's a path forward.

For reporting, we have already discussed a dedicated permitted use for that (currently financial logging and auditing in the text).  During the call last Wednesday, I expressed concern about the extensive time periods that companies wanted to keep data for this purpose (Shane framed the debate as a couple years vs several years).  Perhaps one way to ease advocates' concern about this permitted use would be to require that data collected for this purpose be lightly de-identified pursuant to Shane's standard (hashing plus internal access controls).


As for modeling, I think you should be able to do that with extensive longitudinal data sets under the the more robust approach.  While the idea of aggregate reporting as a dependent permitted use was not adopted in Bellevue, I don't see why you couldn't strongly deidentify (hash-and-throw-away-the-key) an existing data set kept for a separate permitted use and use the resulting data set for modeling or other research purposes.  That is, if you have a data set on users going back 6 months for security purposes, you could deidentify that data pursuant to Dan's standard and still have six months' worth of data tied to unique (but deidentified) users for modeling purposes.  At that point, the data would be wholly outside the scope of DNT.  You still might need to strip down the urls to make sure they don't contain identifying information and otherwise ensure that it's reasonable to believe that the string of urls couldn't be tied to a person, but the standard would not need to be prescriptive on how that is achieved.  As noted in Bellevue, this would still provide a perverse incentive to companies to overstate the need to retain data for a permitted use, but I'm not sure there's a way around that.  _____  

From: Shane Wiley [mailto:wileys@yahoo-inc.com]
To: John Simpson [mailto:john@consumerwatchdog.org]
Cc: Dan Auerbach [mailto:dan@eff.org], public-tracking@w3.org [mailto:public-tracking@w3.org]
Sent: Tue, 02 Apr 2013 16:15:25 -0500
Subject: RE: de-identification text for Wednesday's call

              
  

Reporting and modeling purposes.  

   

- Shane  

   
  
  

From: John Simpson [mailto:john@consumerwatchdog.org]  
  Sent: Tuesday, April 02, 2013 12:10 PM
  To: Shane Wiley
  Cc: Dan Auerbach; public-tracking@w3.org
  Subject: Re: de-identification text for Wednesday's call      

   

then what are you retaining and using the data for?  
  

   
  
  

On Apr 2, 2013, at 11:03 AM, Shane Wiley <wileys@yahoo-inc.com> wrote:    


  
    
  
  

Dan,    
  

     
  

Once the one-way hash is applied (and other elements of record appropriately cleansed) the data is moved to a system that is not allowed to be accessed externally.    Its these operational and administrative controls that are essential to ensure de-identified data is not re-identified at some later time.  I believe you’re looking only at the technical merits which is only seeing a small portion of the overall solution.    
  

     
  

- Shane    
  

     
  
  
  

From: Dan   Auerbach [mailto:dan@eff.org] 
  Sent: Tuesday, April 02, 2013 10:59 AM
  To: public-tracking@w3.org
  Subject: Re: de-identification text for Wednesday's call        
  

     
  
  

On 04/02/2013 08:50 AM, Shane Wiley wrote:        
  

once the one-way hash function has been applied the data is never again able to be accessed in real-time to modify the user’s experience.      
  

I think I'm confused, can you explain this more? How is this possible? If you are just hashing a cookie string, your web server receives a request that includes a cookie string, you hash that cookie string (which is in incredibly fast operation),   match the hashed cookie against the stored data, and return personalized results.
  
  Or are you salting the hash differently for every request, or combining the cookie with an ephemeral piece of data (the timestamp) before hashing and then throwing away the timestamp?
  
  Thanks for clarifying, apologies if I'm just being dense.
  
  Dan
  
  
  
      

--   

Dan Auerbach  

Staff Technologist  

Electronic Frontier Foundation  

dan@eff.org  

415 436 9333 x134
Received on Wednesday, 3 April 2013 02:33:56 UTC