Re: de-identification text for Wednesday's call from Justin Brookman on 2013-04-03 (public-tracking@w3.org from April 2013)

From: Justin Brookman <jbrookman@cdt.org>
Date: Wed, 03 Apr 2013 11:17:59 -0400
To: Shane Wiley <wileys@yahoo-inc.com>
Cc: "public-tracking@w3.org " <public-tracking@w3.org>
Message-ID: <b2uor9tvpaka0gluhu2xcp5x.1365002092364@email.android.com>
I could live with a three-state model, but would shift what is allowed in each bucket.  For raw log data (red), you can retain for security and debugging (and I guess frequency capping if you're not doing double-keyed cookies).  For hashed-plus-access-controls-on-rotating-key (yellow), you can retain for audit and reporting.  For fully deidentified (key gone, green), you can do analytics.  If you have to retain at a more specific level for a particular permitted use, you don't need to create a new data set for another purpose.  So if you need for security, you don't need to create a deidentified dataset for your analytics.

Sent from mobile, please excuse curtness and typos

Shane Wiley <wileys@yahoo-inc.com> wrote:

><!-- /* Font Definitions */ @font-face  {font-family:"Cambria Math";  panose-1:2 4 5 3 5 4 6 3 2 4;} @font-face  {font-family:Calibri;  panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face  {font-family:Tahoma;  panose-1:2 11 6 4 3 5 4 4 2 4;} @font-face  {font-family:Consolas;  panose-1:2 11 6 9 2 2 4 3 2 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal  {margin:0in;  margin-bottom:.0001pt;  line-height:120%;  font-size:12.0pt;  font-family:"Times New Roman","serif";} a:link, span.MsoHyperlink  {mso-style-priority:99;  color:blue;  text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed  {mso-style-priority:99;  color:purple;  text-decoration:underline;} pre  {mso-style-priority:99;  mso-style-link:"HTML Preformatted Char";  margin:0in;  margin-bottom:.0001pt;  font-size:10.0pt;  font-family:"Courier New";} p.MsoAcetate, li.MsoAcetate, div.MsoAcetate  {mso-style-priority:99;  mso-style-link:"Balloon Text Char";  margin:0in;  margin-bottom:.0001pt;  font-size:8.0pt;  font-family:"Tahoma","sans-serif";} span.apple-converted-space  {mso-style-name:apple-converted-space;} span.HTMLPreformattedChar  {mso-style-name:"HTML Preformatted Char";  mso-style-priority:99;  mso-style-link:"HTML Preformatted";  font-family:Consolas;} span.BalloonTextChar  {mso-style-name:"Balloon Text Char";  mso-style-priority:99;  mso-style-link:"Balloon Text";  font-family:"Tahoma","sans-serif";} span.EmailStyle22  {mso-style-type:personal;  font-family:"Calibri","sans-serif";  color:#1F497D;} span.EmailStyle23  {mso-style-type:personal-reply;  font-family:"Calibri","sans-serif";  color:#1F497D;} .MsoChpDefault  {mso-style-type:export-only;  font-size:10.0pt;} @page WordSection1  {size:8.5in 11.0in;  margin:1.0in 1.0in 1.0in 1.0in;} div.WordSection1  {page:WordSection1;} --> 
>
>Justin,
>
> 
>
>The needs for reporting and modeling are fairly similar (in fact you could consider it all “analytics”) so differentiated approaches wouldn’t be very helpful here.  Chunking data as you suggest creates arbitrary gaps in data consistency so I would recommend a different approach – a DNT Data Lifecycle for simplicity sake with 3 stages:
>
> 
>
>Data comes in with DNT -> for non-Permitted Uses data is immediately de-identified (unique IDs, URL, IP Address) with a keyed/secret hash -> data remains in this de-identified state for some period of time (transparency required) -> data is de-identified again with a different keyed/secret hash (key is rotated/destroyed at some regular interval).  
>
> 
>
>This process creates a period of consistently de-identified data and then moves to more of an unlinkable state (I know we’ve agreed to not use that term so I introduce it here only to give a name to the final resting state in the above proposed approach).  
>
> 
>
>While I agree on data reuse where possible in an operational infrastructure versus multiplying storage costs if two data stores are to exists (versus a single one), I would recommend the following structure (of course there could be more than two for more specific use cases):  
>
> 
>
>Incoming Data w/DNT:
>
>--->  Raw Data Store (Security, Debugging, Frequency Capping, some forms of Financial/Audit)
>
>                --->  De-Identified Data Store (Analytics, some forms of Financial/Audit)
>
> 
>
>With the Raw Data Store having a different (shorter) data retention period than the De-Identified Data Store.  Once data has been de-identified (per Dan’s proposed definition), I consider it out of scope of DNT.  Do you agree?
>
> 
>
>Thank you,
>
>- Shane 
>
> 
>
>From: Justin Brookman [mailto:jbrookman@cdt.org] 
>Sent: Tuesday, April 02, 2013 7:33 PM
>To: public-tracking@w3.org
>Subject: RE: de-identification text for Wednesday's call
>
> 
>
>If this is all you're doing with it, maybe there's a path forward.
>
>For reporting, we have already discussed a dedicated permitted use for that (currently financial logging and auditing in the text).  During the call last Wednesday, I expressed concern about the extensive time periods that companies wanted to keep data for this purpose (Shane framed the debate as a couple years vs several years).  Perhaps one way to ease advocates' concern about this permitted use would be to require that data collected for this purpose be lightly de-identified pursuant to Shane's standard (hashing plus internal access controls).
>
> 
>
>As for modeling, I think you should be able to do that with extensive longitudinal data sets under the the more robust approach.  While the idea of aggregate reporting as a dependent permitted use was not adopted in Bellevue, I don't see why you couldn't strongly deidentify (hash-and-throw-away-the-key) an existing data set kept for a separate permitted use and use the resulting data set for modeling or other research purposes.  That is, if you have a data set on users going back 6 months for security purposes, you could deidentify that data pursuant to Dan's standard and still have six months' worth of data tied to unique (but deidentified) users for modeling purposes.  At that point, the data would be wholly outside the scope of DNT.  You still might need to strip down the urls to make sure they don't contain identifying information and otherwise ensure that it's reasonable to believe that the string of urls couldn't be tied to a person, but the standard would not need to be prescriptive on how that is achieved.  As noted in Bellevue, this would still provide a perverse incentive to companies to overstate the need to retain data for a permitted use, but I'm not sure there's a way around that.
>
>From: Shane Wiley [mailto:wileys@yahoo-inc.com]
>To: John Simpson [mailto:john@consumerwatchdog.org]
>Cc: Dan Auerbach [mailto:dan@eff.org], public-tracking@w3.org [mailto:public-tracking@w3.org]
>Sent: Tue, 02 Apr 2013 16:15:25 -0500
>Subject: RE: de-identification text for Wednesday's call
>
>Reporting and modeling purposes.
>
> 
>
>- Shane
>
> 
>
>From: John Simpson [mailto:john@consumerwatchdog.org] 
>Sent: Tuesday, April 02, 2013 12:10 PM
>To: Shane Wiley
>Cc: Dan Auerbach; public-tracking@w3.org
>Subject: Re: de-identification text for Wednesday's call
>
> 
>
>then what are you retaining and using the data for?
>
> 
>
>On Apr 2, 2013, at 11:03 AM, Shane Wiley <wileys@yahoo-inc.com> wrote:
>
> 
>
>Dan,
>
> 
>
>Once the one-way hash is applied (and other elements of record appropriately cleansed) the data is moved to a system that is not allowed to be accessed externally.  Its these operational and administrative controls that are essential to ensure de-identified data is not re-identified at some later time.  I believe you’re looking only at the technical merits which is only seeing a small portion of the overall solution.
>
> 
>
>- Shane
>
> 
>
>From: Dan Auerbach [mailto:dan@eff.org] 
>Sent: Tuesday, April 02, 2013 10:59 AM
>To: public-tracking@w3.org
>Subject: Re: de-identification text for Wednesday's call
>
> 
>
>On 04/02/2013 08:50 AM, Shane Wiley wrote:
>
>once the one-way hash function has been applied the data is never again able to be accessed in real-time to modify the user’s experience.
>
>I think I'm confused, can you explain this more? How is this possible? If you are just hashing a cookie string, your web server receives a request that includes a cookie string, you hash that cookie string (which is in incredibly fast operation), match the hashed cookie against the stored data, and return personalized results.
>
>Or are you salting the hash differently for every request, or combining the cookie with an ephemeral piece of data (the timestamp) before hashing and then throwing away the timestamp?
>
>Thanks for clarifying, apologies if I'm just being dense.
>
>Dan
>
>
>-- 
>
>Dan Auerbach
>
>Staff Technologist
>
>Electronic Frontier Foundation
>
>dan@eff.org
>
>415 436 9333 x134
>
> 
>
Received on Wednesday, 3 April 2013 15:18:46 UTC