Re: Deidentification (ISSUE-188) from Justin Brookman on 2014-08-06 (public-tracking@w3.org from August 2014)

From: Justin Brookman <jbrookman@cdt.org>
Date: Wed, 6 Aug 2014 11:29:25 -0400
To: David Singer <singer@apple.com>
Cc: "public-tracking@w3.org List" <public-tracking@w3.org>
Message-Id: <E2B60104-CABD-4C01-959A-13FD69CC151F@cdt.org>
On Jul 31, 2014, at 7:54 PM, David Singer <singer@apple.com> wrote:

> Let’s look at how we use the term and whether we want
> * deidentified
> * persistently deidentified
> * anonymized
> * noa
> 
> or something else.  Here are where we use the term right now.
> 
> * * * *
> 
> 2.10 — definition.  I don’t repeat it as that’s the section we are trying to write
> 
> (I note, by the way, that we define it without a hyphen and then uniformly use it with a hyphen, which, for a defined term, is poor form!)
> 
> 5. Third party compliance
> 
> [except]
> 
> A third party to a given user action may nevertheless collect and use such data when:
> …
> 	• or, the data is de-identified as defined in this recommendation.
> 
> 
> 
> 5.2.2, part of the general principles for permitted uses
> 
> After there are no remaining permitted uses for given data, the data must be deleted or de-identified.
> 
> 
> 8 Unknowing collection
> 
> If a party learns that it possesses data in violation of this recommendation, it must, where reasonably feasible, delete or de-identify that data at the earliest practical opportunity
> 
> * * * *
> 
> In general, I think in all three cases we are saying that if it meets this criterion, the data has passed out of scope and cannot or will not come back into scope (i.e. by re-identification).  
> 
> In which of these could ‘grey state’ data — data that can be re-identified by someone in the know, e.g. of the secret key — apply?  They may apply importantly in the health domain (you’ve just realized that an important subset of the data has some treatable but serious disease, for example), but is that really true here? In particular, we are trying, I think to improve users privacy by ensuring that the people who could and did observe you are not ‘tracking’ you at all — yet those are the very same as would make and hold such a secret key.  It seems to me that there could be lengthy debates here, and we don’t need them.

I think this is one distinction between the NAI definition on the one hand and Roy’s and Vincent’s on the other.  NAI envisions that the secret key is maintained (but not used); Roy’s and Vincent’s (I think) envision that you couldn’t reidentify even if you wanted to.

> 
> In none of these cases are we talking about public disclosure as such, in fact; we are saying that the data passes out of our scope, which means we no longer have anything to say about disclosure, retention, use, or anything at all.

Right.  Under the standard, public disclosure of deidentified data is out of scope and not prohibited or limited in any way, unless you want to say that a condition of “deidentification” is a promise by all holders not to reidentify the data, in which case you probably couldn’t publicly release the data set (unless you get someone to click on an agreement not to try to reidentify prior to their accessing the data).

That last part is the key question for you — do you still want to require a promise-by-all-not-to-try-to-reidentify as a condition of deidentification, or do you want to support one of the other three options?  You alternatively have suggested that the releaser bear responsibility for the data in the event it’s deidentified, which I think the other options effectively cover — if you represented to the user you weren’t going to share tracking data and you accidentally did, I don’t think there’s a good faith exception to the prohibition on deceptive statements, at least not in the U.S.

> 
> 
> On Jul 29, 2014, at 19:11 , Justin Brookman <jbrookman@cdt.org> wrote:
> 
>> 
>>> Do either of you want to suggest language for the spec to bind parties to 
>>> not try to reidentify? 
>> 
>> The concept appears 3 times in the TCS, and in each place, a requirement to keep it de-identified would seem tricky to write. (Someone is welcome to try). 
>> 
>> Perhaps it would be cleaner to have two definitions: 
>> 
>> * de-identified 
>> 
>> * persistently de-identified 
>> 
>> with the first being a definition of the state (as above), and the second has the data carrying the requirement requirement that the originator not attempt to re-identify, and that any sharing with another party by the originator or anyone receiving the data with this restriction, either pass on the restriction, or accept the responsibility if re-identification in fact occurs. 
>> 
>> then we can use the one or the other in the document, as appropriate. 
>> 
>> So this sounds like a stricter version of the red-yellow-green discussion from before.  What do you envision requiring regular deidentification, and what would require persistently de-identified (really deidentified + promises/liability)?  Would it be just for sharing?  So there wouldn't need to be an internal promise not to reidentify, but if you release, you either get a promise or take responsibility?
>> 
>> What would "responsibility" look like?  We can't really create a cause of action with a technical standard.
>> 
> 
> Perhaps we say that if the data is later re-identified, then the party that thought it had done deidentification was in error, and clause 8 applies (i.e. it has to delete the data or immediately improve the de-identifcation).
> 
> I think there is value in saying also that the requirement not to re-identify may be passed on.
> 
> 
> David Singer
> Manager, Software Standards, Apple Inc.
>
Received on Wednesday, 6 August 2014 15:29:39 UTC