Re: Deidentification (ISSUE-188)

On Sep 8, 2014, at 2:34 , Mike O'Neill <michael.oneill@baycloud.com> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> I would prefer there was no mention of personally unique data kept in a retained data set, because it looks sneaky. I don't mind if my data is in a suitably large enough (one way) aggregation, but then why would there continue to be unique keys?
> 
> This being said I can live with David's last text combining Roy's definition with the non-normative explanation, if the reference to unique keys was tightened up to make clear the retained data is no longer linkable, i.e. the keys are not capable of being regenerated or derived from other data or subsequent transactions.
> 
> How about ( just adding "and cannot be derived" and taking out the "/or"):
> Regardless of the de-identification approach, unique keys can be used to correlate records within the de-identified dataset, provided the keys do not exist and cannot be derived outside the de-identified dataset and have no meaning outside the de-identified dataset (i.e. no mapping table can exist that links the original identifiers to the keys in the de-identified dataset.)

fine by me

> 
> Mike
> 
> 
>> -----Original Message-----
>> From: David Singer [mailto:singer@apple.com]
>> Sent: 02 September 2014 23:41
>> To: public-tracking@w3.org WG
>> Subject: Re: Deidentification (ISSUE-188)
>> 
>> hi Roy, all
>> 
>> thank you.
>> 
>> I think this is perfectly fine as a definition, and if we want to say more it would
>> be best to have another section that is marked as non-normative that gives
>> some of the advice we previously included.  I think that Roy’s text captures the
>> essence of Jack’s, as well.  For example, Jack saith:
>> 
>>> the de-identifying entity must not have actual knowledge that the remaining
>> information could be used alone or in combination with other reasonably
>> available information to identify an individual who is subject of the information
>> 
>> Having an advisor section allows us to capture some of Jack’s text, which I fear
>> is too long for a definition but rides well in an advisor section.  I have tried to do
>> that below.
>> 
>> Here is a stab at following Roy’s definition with such an advisory section.
>> 
>> * * * * * *
>> 
>> Definition:
>> 
>>>  Data is permanently de-identified when there exists a high level
>>>  of confidence that no human subject of the data can be identified,
>>>  directly or indirectly, by that data alone or in combination with
>>>  other retained or available information.
>> 
>> 
>> De-identification background (informative)
>> 
>> In this specification the term ‘permanently de-identified’ is used for data that
>> has passed out of the scope of this specification and can and will never come
>> back into scope. The organization that performs the de-identification needs to
>> be confident that the data can never again identify the human subjects whose
>> activity contributed to the data. That confidence may result from ensuring or
>> demonstrating that it is no longer possible to:
>> - isolate some or all records which correspond to a device or user;
>> - link two or more records (either from the same database or different
>> databases), concerning the same device or user;
>> - deduce, with significant probability, information about a device or user.
>> 
>> Regardless of the de-identification approach, unique keys can be used to
>> correlate records within the de-identified dataset, provided the keys do not exist
>> outside the de-identified dataset and/or have no meaning outside the de-
>> identified dataset (i.e. no mapping table can exist that links the original
>> identifiers to the keys in the de-identified dataset.)
>> 
>> In the case of records in such data that relate to a single user or a small number
>> of users, usage and/or distribution restrictions are advisable; experience has
>> shown that such records can, in fact, sometimes be used to identify the user(s)
>> despite the technical measures that were taken to prevent re-identification. It is
>> also a good practice to disclose (e.g. in the privacy policy) the process by which
>> de-identification of these records is done, as this can both raise the level of
>> confidence in the process, and allow for for feedback on the process.  The
>> restrictions might include, for example:
>> 	• Technical safeguards that prohibit re-identification of de-identified
>> data and/or merging of the original tracking data and de-identified data;
>> 	• Business processes that specifically prohibit re-identification of de-
>> identified data and/or merging of the original tracking data and de-identified
>> data;
>> 	• Business processes that prevent inadvertent release of either the
>> original tracking data or de-identified data;
>> 	• Administrative controls that limit access to both the original tracking
>> data and de-identified data.
>> 
>> 
>> On Aug 26, 2014, at 11:57 , Roy T. Fielding <fielding@gbiv.com> wrote:
>> 
>>> I am still in favor of a short definition that makes it very clear what
>>> we want to achieve in terms of limiting the data.  If folks want to place
>>> additional requirements on a party, separate from the definition of the
>>> state we want the data to be in, then I think that should be discussed
>>> and agreed on separately.
>>> 
>>> To that end, I have replaced my proposal with the following:
>>> 
>>>  Data is permanently de-identified when there exists a high level
>>>  of confidence that no human subject of the data can be identified,
>>>  directly or indirectly, by that data alone or in combination with
>>>  other retained or available information.
>>> 
>>> If adopted, we would replace all occurrences of "de-identif(y|ied|ying)"
>>> in TCS and TPE with permanently de-identified.
>>> 
>>> Rationale:
>>> 
>>> I adopted David's "permanently de-identified" to avoid the association
>>> with re-identifiable data and added "combination with other retained ...
>>> information" to exclude holding onto a key for re-identification.
>>> 
>>> I replaced "user" with "human subject of the data", since we also want
>>> to remove data provided by the user that (inadvertently) is about
>>> others (what most statistic-based data trimming does automatically).
>>> However, we don't want to remove data which might be about a human
>>> who is not the subject (e.g., recording the number of distinct visitors
>>> to my blog is data about the visitors, not about me).
>>> 
>>> I use "directly or indirectly" to indicate that this includes anything
>>> that might end up identifying a human subject, no matter how.
>>> If someone thinks we should have specific text about identifiers on
>>> user agents or devices, that can be a non-normative example without
>>> weakening this definition.
>>> 
>>> 
>>> Cheers,
>>> 
>>> Roy T. Fielding                     <http://roy.gbiv.com/>
>>> Senior Principal Scientist, Adobe   <http://www.adobe.com/>
>>> 
>> 
>> David Singer
>> Manager, Software Standards, Apple Inc.
>> 
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.13 (MingW32)
> Comment: Using gpg4o v3.3.26.5094 - http://www.gpg4o.com/
> Charset: utf-8
> 
> iQEcBAEBAgAGBQJUDXgyAAoJEHMxUy4uXm2JLbYH/1KGrg1Ym7fYkD8lqwpH2eDe
> bV9A+ThL9nzGuUAL+gvK0kdj8oRX3scCeWMVGBXrZFrkFmKYoMqWRAmkQgddDFEx
> 7iL9kwMz9cHNr0gezo22ljIjls2Ms/KUebvD8ndMgR00p9NdXCxvwVkF8xXvZDGL
> fuY7ZSZOiJacFaNMINe5Yk3x0z/cky8bZgzA4nLO4Oq8erV2TZTDnfkc0dx9Zy6/
> fSbT135ambhkCwEeEK0D4jQAF8cYUDDaQiy1NJoDkJHAmrtk/7dFDfeTrl3aXhsE
> DsGXntjlGG0gKiW5bMmO8TheVG2zl7AKhsOmby6aK1VWjNBIiqdRc/aGsg0xwR0=
> =onrj
> -----END PGP SIGNATURE-----
> 

David Singer
Manager, Software Standards, Apple Inc.

Received on Monday, 8 September 2014 16:34:27 UTC