Re: Deidentification (ISSUE-188) from David Singer on 2014-09-02 (public-tracking@w3.org from September 2014)

From: David Singer <singer@apple.com>
Date: Tue, 02 Sep 2014 15:40:59 -0700
To: "public-tracking@w3.org WG" <public-tracking@w3.org>
Message-id: <5637B31C-AF51-454B-AC03-50509B3E5F24@apple.com>
hi Roy, all

thank you.

I think this is perfectly fine as a definition, and if we want to say more it would be best to have another section that is marked as non-normative that gives some of the advice we previously included.  I think that Roy’s text captures the essence of Jack’s, as well.  For example, Jack saith:

>  the de-identifying entity must not have actual knowledge that the remaining information could be used alone or in combination with other reasonably available information to identify an individual who is subject of the information

Having an advisor section allows us to capture some of Jack’s text, which I fear is too long for a definition but rides well in an advisor section.  I have tried to do that below.

Here is a stab at following Roy’s definition with such an advisory section.  

* * * * * *

Definition:

>   Data is permanently de-identified when there exists a high level
>   of confidence that no human subject of the data can be identified,
>   directly or indirectly, by that data alone or in combination with
>   other retained or available information.


De-identification background (informative)

In this specification the term ‘permanently de-identified’ is used for data that has passed out of the scope of this specification and can and will never come back into scope. The organization that performs the de-identification needs to be confident that the data can never again identify the human subjects whose activity contributed to the data. That confidence may result from ensuring or demonstrating that it is no longer possible to:
 - isolate some or all records which correspond to a device or user;
 - link two or more records (either from the same database or different databases), concerning the same device or user;
 - deduce, with significant probability, information about a device or user.

Regardless of the de-identification approach, unique keys can be used to correlate records within the de-identified dataset, provided the keys do not exist outside the de-identified dataset and/or have no meaning outside the de-identified dataset (i.e. no mapping table can exist that links the original identifiers to the keys in the de-identified dataset.)

In the case of records in such data that relate to a single user or a small number of users, usage and/or distribution restrictions are advisable; experience has shown that such records can, in fact, sometimes be used to identify the user(s) despite the technical measures that were taken to prevent re-identification. It is also a good practice to disclose (e.g. in the privacy policy) the process by which de-identification of these records is done, as this can both raise the level of confidence in the process, and allow for for feedback on the process.  The restrictions might include, for example:
	• Technical safeguards that prohibit re-identification of de-identified data and/or merging of the original tracking data and de-identified data;
	• Business processes that specifically prohibit re-identification of de-identified data and/or merging of the original tracking data and de-identified data;
	• Business processes that prevent inadvertent release of either the original tracking data or de-identified data;
	• Administrative controls that limit access to both the original tracking data and de-identified data.


On Aug 26, 2014, at 11:57 , Roy T. Fielding <fielding@gbiv.com> wrote:

> I am still in favor of a short definition that makes it very clear what
> we want to achieve in terms of limiting the data.  If folks want to place
> additional requirements on a party, separate from the definition of the
> state we want the data to be in, then I think that should be discussed
> and agreed on separately.
> 
> To that end, I have replaced my proposal with the following:
> 
>   Data is permanently de-identified when there exists a high level
>   of confidence that no human subject of the data can be identified,
>   directly or indirectly, by that data alone or in combination with
>   other retained or available information.
> 
> If adopted, we would replace all occurrences of "de-identif(y|ied|ying)"
> in TCS and TPE with permanently de-identified.
> 
> Rationale:
> 
> I adopted David's "permanently de-identified" to avoid the association
> with re-identifiable data and added "combination with other retained ...
> information" to exclude holding onto a key for re-identification.
> 
> I replaced "user" with "human subject of the data", since we also want
> to remove data provided by the user that (inadvertently) is about
> others (what most statistic-based data trimming does automatically).
> However, we don't want to remove data which might be about a human
> who is not the subject (e.g., recording the number of distinct visitors
> to my blog is data about the visitors, not about me).
> 
> I use "directly or indirectly" to indicate that this includes anything
> that might end up identifying a human subject, no matter how.
> If someone thinks we should have specific text about identifiers on
> user agents or devices, that can be a non-normative example without
> weakening this definition.
> 
> 
> Cheers,
> 
> Roy T. Fielding                     <http://roy.gbiv.com/>
> Senior Principal Scientist, Adobe   <http://www.adobe.com/>
> 

David Singer
Manager, Software Standards, Apple Inc.
Received on Tuesday, 2 September 2014 22:41:31 UTC