Re: Deidentification (ISSUE-188) from Rob van Eijk on 2014-09-10 (public-tracking@w3.org from September 2014)

From: Rob van Eijk <rob@blaeu.com>
Date: Wed, 10 Sep 2014 16:15:48 +0200
To: David Singer <singer@apple.com>
Cc: "public-tracking@w3.org WG" <public-tracking@w3.org>
Message-ID: <5601c563adc46c9d607ed2ddf4deba18@xs4all.nl>
Roy, David, we would like to address two things.

First, could you please confirm that the informative text is added to 
the TCM in combination with the definition?

Second, we would like to address is that we are not comfortable with the 
non-binding nature of the informative paragraph. We strongly suggest to 
change some elements to normative requirements. The requirements we feel 
strongly about are:

a) The organization that performs the de-identification MUST be 
confident that the data can never again identify the human subjects 
whose activity contributed to the data.

b) The organization SHOULD also disclose (e.g. in the privacy policy) 
the process by which de-identification of these records is done.

c) That confidence MAY result from ensuring or demonstrating that it is 
no longer possible to:
- isolate some or all records which correspond to a device or user;
- link two or more records (either from the same database or different 
databases), concerning the same device or user;
- deduce, with significant probability, information about a device or 
user

Regards,
Rob, Vincent


Rob van Eijk schreef op 2014-09-05 13:38:
> Thank you David, Roy for the perseverance shown. David's last effort
> seems to accomodate many, if not all elements needed to get beyond the
> confusion caused by the 'broken promise of anonymisation'. I will
> think this proposal through the coming days.
> 
> Two points I would like to draw your attention to:
> 
> [1] In order to avoid the pitfall of context extraction for
> (re)targeting falling out of scope of DNT
> [http://www.w3.org/2011/tracking-protection/2013-july-decision/] I
> invite all members to take into account a reflection on the general
> requirements for permitted uses [TCS, par 3.3.1]. Having a clear
> understanding of ‘permanently de-identified’ may cause us te revisit
> the general requirements. Perhaps some should be more strict, perhaps
> additional requirements are needed. A new issue may be needed to
> address this concern.
> 
> [2] A seperate thought - to avoid any ambiguity between the two terms
> (a) de-identified and (b) 'permanently deidentified' even though the
> first term will not be used anymore under Roy's proposal - is to
> CAPITALIZE all definitions in both the TCS and the TPE. To accomplish
> this outcome, opening an issue may be needed, which off-course is up
> to the discretion of the chairs.
> 
> Rob
> 
> 
> David Singer schreef op 2014-09-03 00:40:
>> hi Roy, all
>> 
>> thank you.
>> 
>> I think this is perfectly fine as a definition, and if we want to say
>> more it would be best to have another section that is marked as
>> non-normative that gives some of the advice we previously included.  I
>> think that Roy’s text captures the essence of Jack’s, as well.  For
>> example, Jack saith:
>> 
>>>  the de-identifying entity must not have actual knowledge that the 
>>> remaining information could be used alone or in combination with 
>>> other reasonably available information to identify an individual who 
>>> is subject of the information
>> 
>> Having an advisor section allows us to capture some of Jack’s text,
>> which I fear is too long for a definition but rides well in an advisor
>> section.  I have tried to do that below.
>> 
>> Here is a stab at following Roy’s definition with such an advisory 
>> section.
>> 
>> * * * * * *
>> 
>> Definition:
>> 
>>>   Data is permanently de-identified when there exists a high level
>>>   of confidence that no human subject of the data can be identified,
>>>   directly or indirectly, by that data alone or in combination with
>>>   other retained or available information.
>> 
>> 
>> De-identification background (informative)
>> 
>> In this specification the term ‘permanently de-identified’ is used for
>> data that has passed out of the scope of this specification and can
>> and will never come back into scope. The organization that performs
>> the de-identification needs to be confident that the data can never
>> again identify the human subjects whose activity contributed to the
>> data. That confidence may result from ensuring or demonstrating that
>> it is no longer possible to:
>>  - isolate some or all records which correspond to a device or user;
>>  - link two or more records (either from the same database or
>> different databases), concerning the same device or user;
>>  - deduce, with significant probability, information about a device or 
>> user.
>> 
>> Regardless of the de-identification approach, unique keys can be used
>> to correlate records within the de-identified dataset, provided the
>> keys do not exist outside the de-identified dataset and/or have no
>> meaning outside the de-identified dataset (i.e. no mapping table can
>> exist that links the original identifiers to the keys in the
>> de-identified dataset.)
>> 
>> In the case of records in such data that relate to a single user or a
>> small number of users, usage and/or distribution restrictions are
>> advisable; experience has shown that such records can, in fact,
>> sometimes be used to identify the user(s) despite the technical
>> measures that were taken to prevent re-identification. It is also a
>> good practice to disclose (e.g. in the privacy policy) the process by
>> which de-identification of these records is done, as this can both
>> raise the level of confidence in the process, and allow for for
>> feedback on the process.  The restrictions might include, for example:
>>  • Technical safeguards that prohibit re-identification of
>> de-identified data and/or merging of the original tracking data and
>> de-identified data;
>>  • Business processes that specifically prohibit re-identification of
>> de-identified data and/or merging of the original tracking data and
>> de-identified data;
>>  • Business processes that prevent inadvertent release of either the
>> original tracking data or de-identified data;
>>  • Administrative controls that limit access to both the original
>> tracking data and de-identified data.
>> 
>> 
>> On Aug 26, 2014, at 11:57 , Roy T. Fielding <fielding@gbiv.com> wrote:
>> 
>>> I am still in favor of a short definition that makes it very clear 
>>> what
>>> we want to achieve in terms of limiting the data.  If folks want to 
>>> place
>>> additional requirements on a party, separate from the definition of 
>>> the
>>> state we want the data to be in, then I think that should be 
>>> discussed
>>> and agreed on separately.
>>> 
>>> To that end, I have replaced my proposal with the following:
>>> 
>>>   Data is permanently de-identified when there exists a high level
>>>   of confidence that no human subject of the data can be identified,
>>>   directly or indirectly, by that data alone or in combination with
>>>   other retained or available information.
>>> 
>>> If adopted, we would replace all occurrences of 
>>> "de-identif(y|ied|ying)"
>>> in TCS and TPE with permanently de-identified.
>>> 
>>> Rationale:
>>> 
>>> I adopted David's "permanently de-identified" to avoid the 
>>> association
>>> with re-identifiable data and added "combination with other retained 
>>> ...
>>> information" to exclude holding onto a key for re-identification.
>>> 
>>> I replaced "user" with "human subject of the data", since we also 
>>> want
>>> to remove data provided by the user that (inadvertently) is about
>>> others (what most statistic-based data trimming does automatically).
>>> However, we don't want to remove data which might be about a human
>>> who is not the subject (e.g., recording the number of distinct 
>>> visitors
>>> to my blog is data about the visitors, not about me).
>>> 
>>> I use "directly or indirectly" to indicate that this includes 
>>> anything
>>> that might end up identifying a human subject, no matter how.
>>> If someone thinks we should have specific text about identifiers on
>>> user agents or devices, that can be a non-normative example without
>>> weakening this definition.
>>> 
>>> 
>>> Cheers,
>>> 
>>> Roy T. Fielding                     <http://roy.gbiv.com/>
>>> Senior Principal Scientist, Adobe   <http://www.adobe.com/>
>>> 
>> 
>> David Singer
>> Manager, Software Standards, Apple Inc.
Received on Wednesday, 10 September 2014 14:16:31 UTC