Re: Deidentification (ISSUE-188) from Rob van Eijk on 2014-09-05 (public-tracking@w3.org from September 2014)

From: Rob van Eijk <rob@blaeu.com>
Date: Fri, 05 Sep 2014 13:38:36 +0200
To: David Singer <singer@apple.com>
Cc: "public-tracking@w3.org WG" <public-tracking@w3.org>
Message-ID: <858c04a7ed192c268c272cc05b56430c@xs4all.nl>
Thank you David, Roy for the perseverance shown. David's last effort 
seems to accomodate many, if not all elements needed to get beyond the 
confusion caused by the 'broken promise of anonymisation'. I will think 
this proposal through the coming days.

Two points I would like to draw your attention to:

[1] In order to avoid the pitfall of context extraction for 
(re)targeting falling out of scope of DNT 
[http://www.w3.org/2011/tracking-protection/2013-july-decision/] I 
invite all members to take into account a reflection on the general 
requirements for permitted uses [TCS, par 3.3.1]. Having a clear 
understanding of ‘permanently de-identified’ may cause us te revisit the 
general requirements. Perhaps some should be more strict, perhaps 
additional requirements are needed. A new issue may be needed to address 
this concern.

[2] A seperate thought - to avoid any ambiguity between the two terms 
(a) de-identified and (b) 'permanently deidentified' even though the 
first term will not be used anymore under Roy's proposal - is to 
CAPITALIZE all definitions in both the TCS and the TPE. To accomplish 
this outcome, opening an issue may be needed, which off-course is up to 
the discretion of the chairs.

Rob


David Singer schreef op 2014-09-03 00:40:
> hi Roy, all
> 
> thank you.
> 
> I think this is perfectly fine as a definition, and if we want to say
> more it would be best to have another section that is marked as
> non-normative that gives some of the advice we previously included.  I
> think that Roy’s text captures the essence of Jack’s, as well.  For
> example, Jack saith:
> 
>>  the de-identifying entity must not have actual knowledge that the 
>> remaining information could be used alone or in combination with other 
>> reasonably available information to identify an individual who is 
>> subject of the information
> 
> Having an advisor section allows us to capture some of Jack’s text,
> which I fear is too long for a definition but rides well in an advisor
> section.  I have tried to do that below.
> 
> Here is a stab at following Roy’s definition with such an advisory 
> section.
> 
> * * * * * *
> 
> Definition:
> 
>>   Data is permanently de-identified when there exists a high level
>>   of confidence that no human subject of the data can be identified,
>>   directly or indirectly, by that data alone or in combination with
>>   other retained or available information.
> 
> 
> De-identification background (informative)
> 
> In this specification the term ‘permanently de-identified’ is used for
> data that has passed out of the scope of this specification and can
> and will never come back into scope. The organization that performs
> the de-identification needs to be confident that the data can never
> again identify the human subjects whose activity contributed to the
> data. That confidence may result from ensuring or demonstrating that
> it is no longer possible to:
>  - isolate some or all records which correspond to a device or user;
>  - link two or more records (either from the same database or
> different databases), concerning the same device or user;
>  - deduce, with significant probability, information about a device or 
> user.
> 
> Regardless of the de-identification approach, unique keys can be used
> to correlate records within the de-identified dataset, provided the
> keys do not exist outside the de-identified dataset and/or have no
> meaning outside the de-identified dataset (i.e. no mapping table can
> exist that links the original identifiers to the keys in the
> de-identified dataset.)
> 
> In the case of records in such data that relate to a single user or a
> small number of users, usage and/or distribution restrictions are
> advisable; experience has shown that such records can, in fact,
> sometimes be used to identify the user(s) despite the technical
> measures that were taken to prevent re-identification. It is also a
> good practice to disclose (e.g. in the privacy policy) the process by
> which de-identification of these records is done, as this can both
> raise the level of confidence in the process, and allow for for
> feedback on the process.  The restrictions might include, for example:
>  • Technical safeguards that prohibit re-identification of
> de-identified data and/or merging of the original tracking data and
> de-identified data;
>  • Business processes that specifically prohibit re-identification of
> de-identified data and/or merging of the original tracking data and
> de-identified data;
>  • Business processes that prevent inadvertent release of either the
> original tracking data or de-identified data;
>  • Administrative controls that limit access to both the original
> tracking data and de-identified data.
> 
> 
> On Aug 26, 2014, at 11:57 , Roy T. Fielding <fielding@gbiv.com> wrote:
> 
>> I am still in favor of a short definition that makes it very clear 
>> what
>> we want to achieve in terms of limiting the data.  If folks want to 
>> place
>> additional requirements on a party, separate from the definition of 
>> the
>> state we want the data to be in, then I think that should be discussed
>> and agreed on separately.
>> 
>> To that end, I have replaced my proposal with the following:
>> 
>>   Data is permanently de-identified when there exists a high level
>>   of confidence that no human subject of the data can be identified,
>>   directly or indirectly, by that data alone or in combination with
>>   other retained or available information.
>> 
>> If adopted, we would replace all occurrences of 
>> "de-identif(y|ied|ying)"
>> in TCS and TPE with permanently de-identified.
>> 
>> Rationale:
>> 
>> I adopted David's "permanently de-identified" to avoid the association
>> with re-identifiable data and added "combination with other retained 
>> ...
>> information" to exclude holding onto a key for re-identification.
>> 
>> I replaced "user" with "human subject of the data", since we also want
>> to remove data provided by the user that (inadvertently) is about
>> others (what most statistic-based data trimming does automatically).
>> However, we don't want to remove data which might be about a human
>> who is not the subject (e.g., recording the number of distinct 
>> visitors
>> to my blog is data about the visitors, not about me).
>> 
>> I use "directly or indirectly" to indicate that this includes anything
>> that might end up identifying a human subject, no matter how.
>> If someone thinks we should have specific text about identifiers on
>> user agents or devices, that can be a non-normative example without
>> weakening this definition.
>> 
>> 
>> Cheers,
>> 
>> Roy T. Fielding                     <http://roy.gbiv.com/>
>> Senior Principal Scientist, Adobe   <http://www.adobe.com/>
>> 
> 
> David Singer
> Manager, Software Standards, Apple Inc.
Received on Friday, 5 September 2014 11:39:10 UTC