RE: de-identification text for Wednesday's call from Shane Wiley on 2013-04-03 (public-tracking@w3.org from April 2013)

From: Shane Wiley <wileys@yahoo-inc.com>
Date: Wed, 3 Apr 2013 04:55:57 +0000
To: Justin Brookman <jbrookman@cdt.org>, "public-tracking@w3.org" <public-tracking@w3.org>
Message-ID: <DCCF036E573F0142BD90964789F720E3140355B8@GQ1-MB01-02.y.corp.yahoo.com>
Justin,

The needs for reporting and modeling are fairly similar (in fact you could consider it all “analytics”) so differentiated approaches wouldn’t be very helpful here.  Chunking data as you suggest creates arbitrary gaps in data consistency so I would recommend a different approach – a DNT Data Lifecycle for simplicity sake with 3 stages:

Data comes in with DNT -> for non-Permitted Uses data is immediately de-identified (unique IDs, URL, IP Address) with a keyed/secret hash -> data remains in this de-identified state for some period of time (transparency required) -> data is de-identified again with a different keyed/secret hash (key is rotated/destroyed at some regular interval).

This process creates a period of consistently de-identified data and then moves to more of an unlinkable state (I know we’ve agreed to not use that term so I introduce it here only to give a name to the final resting state in the above proposed approach).

While I agree on data reuse where possible in an operational infrastructure versus multiplying storage costs if two data stores are to exists (versus a single one), I would recommend the following structure (of course there could be more than two for more specific use cases):

Incoming Data w/DNT:
--->  Raw Data Store (Security, Debugging, Frequency Capping, some forms of Financial/Audit)
                --->  De-Identified Data Store (Analytics, some forms of Financial/Audit)

With the Raw Data Store having a different (shorter) data retention period than the De-Identified Data Store.  Once data has been de-identified (per Dan’s proposed definition), I consider it out of scope of DNT.  Do you agree?

Thank you,
- Shane

From: Justin Brookman [mailto:jbrookman@cdt.org]
Sent: Tuesday, April 02, 2013 7:33 PM
To: public-tracking@w3.org
Subject: RE: de-identification text for Wednesday's call

If this is all you're doing with it, maybe there's a path forward.
For reporting, we have already discussed a dedicated permitted use for that (currently financial logging and auditing in the text).  During the call last Wednesday, I expressed concern about the extensive time periods that companies wanted to keep data for this purpose (Shane framed the debate as a couple years vs several years).  Perhaps one way to ease advocates' concern about this permitted use would be to require that data collected for this purpose be lightly de-identified pursuant to Shane's standard (hashing plus internal access controls).

As for modeling, I think you should be able to do that with extensive longitudinal data sets under the the more robust approach.  While the idea of aggregate reporting as a dependent permitted use was not adopted in Bellevue, I don't see why you couldn't strongly deidentify (hash-and-throw-away-the-key) an existing data set kept for a separate permitted use and use the resulting data set for modeling or other research purposes.  That is, if you have a data set on users going back 6 months for security purposes, you could deidentify that data pursuant to Dan's standard and still have six months' worth of data tied to unique (but deidentified) users for modeling purposes.  At that point, the data would be wholly outside the scope of DNT.  You still might need to strip down the urls to make sure they don't contain identifying information and otherwise ensure that it's reasonable to believe that the string of urls couldn't be tied to a person, but the standard would not need to be prescriptive on how that is achieved.  As noted in Bellevue, this would still provide a perverse incentive to companies to overstate the need to retain data for a permitted use, but I'm not sure there's a way around that.
________________________________
From: Shane Wiley [mailto:wileys@yahoo-inc.com]
To: John Simpson [mailto:john@consumerwatchdog.org]
Cc: Dan Auerbach [mailto:dan@eff.org], public-tracking@w3.org<mailto:public-tracking@w3.org> [mailto:public-tracking@w3.org]
Sent: Tue, 02 Apr 2013 16:15:25 -0500
Subject: RE: de-identification text for Wednesday's call
Reporting and modeling purposes.

- Shane

From: John Simpson [mailto:john@consumerwatchdog.org<mailto:john@consumerwatchdog.org>]
Sent: Tuesday, April 02, 2013 12:10 PM
To: Shane Wiley
Cc: Dan Auerbach; public-tracking@w3.org<mailto:public-tracking@w3.org>
Subject: Re: de-identification text for Wednesday's call

then what are you retaining and using the data for?

On Apr 2, 2013, at 11:03 AM, Shane Wiley <wileys@yahoo-inc.com<mailto:wileys@yahoo-inc.com>> wrote:

Dan,

Once the one-way hash is applied (and other elements of record appropriately cleansed) the data is moved to a system that is not allowed to be accessed externally.  Its these operational and administrative controls that are essential to ensure de-identified data is not re-identified at some later time.  I believe you’re looking only at the technical merits which is only seeing a small portion of the overall solution.

- Shane

From: Dan Auerbach [mailto:dan@eff.org<http://eff.org>]
Sent: Tuesday, April 02, 2013 10:59 AM
To: public-tracking@w3.org<mailto:public-tracking@w3.org>
Subject: Re: de-identification text for Wednesday's call

On 04/02/2013 08:50 AM, Shane Wiley wrote:
once the one-way hash function has been applied the data is never again able to be accessed in real-time to modify the user’s experience.
I think I'm confused, can you explain this more? How is this possible? If you are just hashing a cookie string, your web server receives a request that includes a cookie string, you hash that cookie string (which is in incredibly fast operation), match the hashed cookie against the stored data, and return personalized results.

Or are you salting the hash differently for every request, or combining the cookie with an ephemeral piece of data (the timestamp) before hashing and then throwing away the timestamp?

Thanks for clarifying, apologies if I'm just being dense.

Dan



--

Dan Auerbach

Staff Technologist

Electronic Frontier Foundation

dan@eff.org<mailto:dan@eff.org>

415 436 9333 x134
Received on Wednesday, 3 April 2013 04:57:03 UTC