Re: Deidentification (ISSUE-188) from Justin Brookman on 2014-08-08 (public-tracking@w3.org from August 2014)

From: Justin Brookman <jbrookman@cdt.org>
Date: Fri, 8 Aug 2014 07:56:57 -0400
To: TOUBIANA Vincent <vtoubiana@cnil.fr>, David Singer <singer@apple.com>
Cc: public-tracking@w3.org
Message-ID: <3044468974-407912448@mail.maclaboratory.net>
TOUBIANA Vincent <vtoubiana@cnil.fr> , 8/8/2014 6:02 AM:
 
My understanding is that deidentified data can be kept forever and for no defined purpose (it does not correspond to any permitted use). So I believe this definition does not provide sufficient guarantees:  
 
- Under the first criteria, some entities would consider that deidentified data could still contain the full IP address. In many cases these deidentified data could be enough to infer sensitive information about a group of user sharing a same IP address (like a family).
I don't think this test takes a position one way or the other on whether data contained IP addresses --- or any other particular data element --- would be deidentified.  It depends on the nature of the data set.

 
 
- The second criteria focuses on re-identification but does not prevent any kind of profiling. Furthermore, it's an "either/or" so it sounds that an entity would not break this criteria if it re-identifies the data but does not transfer them to any entity. 

No, I don't think this is right.  I guess you can build profiles based on deidentified data sets, but you wouldn't be able to alter the user's experience or otherwise identify the user, user agent or device --- the profiling would just be for research purposes.  And no, you wouldn't be able to reidentify the data and just not transfer it, since the first part of the test is still binding --- you must think that the data is deidentifiable.  If you've reidentified it, it's probably not reasonable to think that it's not reidentifiable!


 
Vincent 
 
 
 
-----Message d'origine----- 
De : David Singer [mailto:singer@apple.com]  
Envoyé : jeudi 7 août 2014 18:15 
À : Justin Brookman 
Cc : public-tracking@w3.org WG 
Objet : Re: Deidentification (ISSUE-188) 
 
 
On Aug 7, 2014, at 9:06 , Justin Brookman <jbrookman@cdt.org> wrote: 
 
> David, under your definition, it sounds like you're trying to force companies who release deidentified data to bind recipients not to identify the data, or they take responsibility in the event the data is subsequently deidentified.  So essentially, there is a safe harbor for entities that bind recipients.  Here is a slightly clunky effort at saying that: 
>  
> A data set is considered deidentified when (1) there exists a reasonable level of justified confidence that none of the data within it can be linked to a particular user, user agent, or device and (2) either any transfer of the data is accompanied by a restriction on recipients from trying to reidentify the data, or the data is not subsequently reidentified. 
 
This might work.  I guess to be formal we should say that the originator is also under the restriction.  We can just say that the data is accompanied by a restriction, or is not subsequently reidentified (deleting the 'transfer').  It then becomes a property of the data. 
 
 
A data set is considered deidentified when (1) there exists a reasonable level of justified confidence that none of the data within it can be linked to a particular user, user agent, or device and (2) either the data is accompanied by a restriction forbidding any attempt to reidentify the data, or the data is not subsequently reidentified. 
 
 
 
>  
> On Aug 6, 2014, at 11:46 AM, David Singer <singer@apple.com> wrote: 
>  
>>  
>> On Aug 6, 2014, at 8:29 , Justin Brookman <jbrookman@cdt.org> wrote: 
>>  
>>>  
>>>  
>>> On Jul 31, 2014, at 7:54 PM, David Singer <singer@apple.com> wrote: 
>>>  
>>>> Let's look at how we use the term and whether we want 
>>>> * deidentified 
>>>> * persistently deidentified 
>>>> * anonymized 
>>>> * noa 
>>>>  
>>>> or something else.  Here are where we use the term right now. 
>>>>  
>>>> * * * * 
>>>>  
>>>> 2.10 - definition.  I don't repeat it as that's the section we are  
>>>> trying to write 
>>>>  
>>>> (I note, by the way, that we define it without a hyphen and then  
>>>> uniformly use it with a hyphen, which, for a defined term, is poor  
>>>> form!) 
>>>>  
>>>> 5. Third party compliance 
>>>>  
>>>> [except] 
>>>>  
>>>> A third party to a given user action may nevertheless collect and use such data when: 
>>>> ... 
>>>>      * or, the data is de-identified as defined in this recommendation. 
>>>>  
>>>>  
>>>>  
>>>> 5.2.2, part of the general principles for permitted uses 
>>>>  
>>>> After there are no remaining permitted uses for given data, the data must be deleted or de-identified. 
>>>>  
>>>>  
>>>> 8 Unknowing collection 
>>>>  
>>>> If a party learns that it possesses data in violation of this  
>>>> recommendation, it must, where reasonably feasible, delete or  
>>>> de-identify that data at the earliest practical opportunity 
>>>>  
>>>> * * * * 
>>>>  
>>>> In general, I think in all three cases we are saying that if it meets this criterion, the data has passed out of scope and cannot or will not come back into scope (i.e. by re-identification).   
>>>>  
>>>> In which of these could 'grey state' data - data that can be re-identified by someone in the know, e.g. of the secret key - apply?  They may apply importantly in the health domain (you've just realized that an important subset of the data has some treatable but serious disease, for example), but is that really true here? In particular, we are trying, I think to improve users privacy by ensuring that the people who could and did observe you are not 'tracking' you at all - yet those are the very same as would make and hold such a secret key.  It seems to me that there could be lengthy debates here, and we don't need them. 
>>>  
>>> I think this is one distinction between the NAI definition on the one hand and Roy's and Vincent's on the other.  NAI envisions that the secret key is maintained (but not used); Roy's and Vincent's (I think) envision that you couldn't reidentify even if you wanted to. 
>>>  
>>>>  
>>>> In none of these cases are we talking about public disclosure as such, in fact; we are saying that the data passes out of our scope, which means we no longer have anything to say about disclosure, retention, use, or anything at all. 
>>>  
>>> Right.  Under the standard, public disclosure of deidentified data is out of scope and not prohibited or limited in any way, unless you want to say that a condition of "deidentification" is a promise by all holders not to reidentify the data, in which case you probably couldn't publicly release the data set (unless you get someone to click on an agreement not to try to reidentify prior to their accessing the data). 
>>>  
>>> That last part is the key question for you - do you still want to require a promise-by-all-not-to-try-to-reidentify as a condition of deidentification, or do you want to support one of the other three options?   
>>  
>> I am now unclear as to what the other three options are you're referring to. Sorry. 
>>  
>>> You alternatively have suggested that the releaser bear responsibility for the data in the event it's deidentified, which I think the other options effectively cover - if you represented to the user you weren't going to share tracking data and you accidentally did, I don't think there's a good faith exception to the prohibition on deceptive statements, at least not in the U.S. 
>>  
>> OK 
>>  
>> Thinking about it, I rather think that data for which there is a key is not data that cannot be tied to a user (user-agent, device) - it totally can, that's what the key does.  I don't think such data has passed out of scope at all.  Your intent and hope that the key and the data never come together again, or that the key has been lost or destroyed, is just that - an intent or hope. To be out of scope, there should not be a key at all, either explicit, or implicit (e.g. a combination of zip-code + birthday + gender etc. that effectively keys to an individual). 
>>  
>> If we *also* want to write rules about this mid-state data, that Shane eloquently explored, we could do that, but it would be in the context of relaxing restrictions on data that is in our scope but we intend cannot identify someone. 
>>  
>> So, I think we need to keep the two characteristics of the data for it to be out of scope - it is strongly believed to be impossible to identify, and either the recipients accept and pass on the restriction from trying, or they accept the consequences if someone downstream succeeds. 
>>  
>>>  
>>>>  
>>>>  
>>>> On Jul 29, 2014, at 19:11 , Justin Brookman <jbrookman@cdt.org> wrote: 
>>>>  
>>>>>  
>>>>>> Do either of you want to suggest language for the spec to bind  
>>>>>> parties to not try to reidentify? 
>>>>>  
>>>>> The concept appears 3 times in the TCS, and in each place, a requirement to keep it de-identified would seem tricky to write. (Someone is welcome to try).  
>>>>>  
>>>>> Perhaps it would be cleaner to have two definitions:  
>>>>>  
>>>>> * de-identified 
>>>>>  
>>>>> * persistently de-identified 
>>>>>  
>>>>> with the first being a definition of the state (as above), and the second has the data carrying the requirement requirement that the originator not attempt to re-identify, and that any sharing with another party by the originator or anyone receiving the data with this restriction, either pass on the restriction, or accept the responsibility if re-identification in fact occurs.  
>>>>>  
>>>>> then we can use the one or the other in the document, as appropriate.  
>>>>>  
>>>>> So this sounds like a stricter version of the red-yellow-green discussion from before.  What do you envision requiring regular deidentification, and what would require persistently de-identified (really deidentified + promises/liability)?  Would it be just for sharing?  So there wouldn't need to be an internal promise not to reidentify, but if you release, you either get a promise or take responsibility? 
>>>>>  
>>>>> What would "responsibility" look like?  We can't really create a cause of action with a technical standard. 
>>>>>  
>>>>  
>>>> Perhaps we say that if the data is later re-identified, then the party that thought it had done deidentification was in error, and clause 8 applies (i.e. it has to delete the data or immediately improve the de-identifcation). 
>>>>  
>>>> I think there is value in saying also that the requirement not to re-identify may be passed on. 
>>>>  
>>>>  
>>>> David Singer 
>>>> Manager, Software Standards, Apple Inc. 
>>>>  
>>>  
>>>  
>>  
>> David Singer 
>> Manager, Software Standards, Apple Inc. 
>>  
>>  
>  
>  
 
David Singer 
Manager, Software Standards, Apple Inc.
Received on Friday, 8 August 2014 11:57:32 UTC