RE: Deidentification (ISSUE-188) from TOUBIANA Vincent on 2014-08-21 (public-tracking@w3.org from August 2014)

From: TOUBIANA Vincent <vtoubiana@cnil.fr>
Date: Thu, 21 Aug 2014 12:36:19 +0200
To: "David Singer" <singer@apple.com>, "Roy T. Fielding" <fielding@gbiv.com>
Cc: "Justin Brookman" <jbrookman@cdt.org>, <public-tracking@w3.org>
Message-ID: <01A1856C4999FF4287CCB37912A708EB075D7D32@srv-cnilexc.cnil.fr>
I think adding a requirement to publicly disclose the anonymization process is a good idea because it would help justifying that "the data within it cannot be used to infer information about, or otherwise be linked to, a particular user" and it does not have to refer to "a reasonable level of confidence".

Vincent

-----Message d'origine-----
De : David Singer [mailto:singer@apple.com] 
Envoyé : jeudi 21 août 2014 01:40
À : Roy T. Fielding
Cc : Justin Brookman; public-tracking@w3.org WG; TOUBIANA Vincent
Objet : Re: Deidentification (ISSUE-188)


On Aug 20, 2014, at 16:25 , Roy T. Fielding <fielding@gbiv.com> wrote:

> On Aug 20, 2014, at 3:30 PM, Justin Brookman wrote:
>> On Aug 18, 2014, at 12:12 PM, David Singer <singer@apple.com> wrote:
>>> On Aug 17, 2014, at 6:35 , Mike O'Neill <michael.oneill@baycloud.com> wrote:
>>> 
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA1
>>>> 
>>>>>> b) the deidentification measures should be described in a form that is at least as available as the data (i.e. publicly, if the data itself will be made public).
>>>> 
>>>> Why not publicly any every case? If someone collects DNT data and intends to share privately it amongst their friends we should know how they shred the PII out of it.
>>> 
>>> OK.  The term {permanently deidentified} below is a candidate for being replaced by a new name of our choosing (e.g. "permanent non-tracking data"), here and where it is used.  How is this?  I made the second clause not a note, as it contains 'should' and 'strongly recommended' i.e. it is not merely informative.
>>> 
>>> * * * * *
>>> 
>>> Data is {permanently de-identified} (and hence out of the scope of this specification) when a sufficient combination of technical measures and restrictions ensures that the data does not, and cannot and will not be used to, identify a particular user, user-agent, or device.
>>> 
>>> In the case of dataset that contain records that relate to a single user or a small number of users:
>>> a) Usage and/or distribution restrictions are strongly recommended; experience has shown that such records can, in fact, sometimes be used to identify the user(s) despite the technical measures that were taken to prevent that happening.
>>> b) the deidentification measures should be described publicly (e.g. in the privacy policy).
>> 
>> OK, we agreed to present this as an option at the Call for Objection today.  I was not following IRC that closely today, but Nick indicated that Roy or Vincent may be satisfied with this option as well, or might be willing to withdraw their proposals as well.  Roy, Vincent, let me know what you want to do, and we can proceed to a CfO on this issue.
> 
> I don't know what to make of that.  Behavioral requirements do not 
> belong in definitions unless they have the effect of partitioning a 
> set of subjects as being in or out of the definition.  A valid 
> definition cannot have any false negatives, which is what you get when 
> data is de-identified but the behavioral requirements are not met.
> 
> I do not believe that industry will describe their de-identification 
> measures publicly; certainly not in a privacy policy.  There are just 
> too many ways that a legal document like the privacy policy can get 
> out of sync, since it requires a great deal of corporate review.  What 
> the policy does is define the black box requirements, and then the 
> technical folks are instructed to adhere to those requirements at a 
> minimum.  The actual technical procedures implemented in practice are 
> often more privacy-preserving than what is publicly declared in a 
> policy and vary depending on which application is being discussed.
> 
> Furthermore, we are not talking about public data.  The fact that many 
> lawyers would prefer to have more transparency into corporate business 
> practices is hardly a justification for additional requirements.
> What matters is the end result, not how a company might get there.
> 
> Regardless, nothing in the spec prevents companies from describing 
> their de-identification measures in a privacy policy. If there is 
> value for them to do so (as there is for EFF), then that value should 
> be justification enough without further imposition by this WG.
> If legislative or regulatory bodies want to impose that kind of 
> obligation, they have the power to do so (usually subject to more 
> responsible oversight and public feedback than a W3C spec).

Thanks Roy

so, I think we should split out the (b) clause and make it a separate question on the consensus.  I agree, it doesn't make a difference to the quality of the {anonymization} to describe what you did - it just enables people to critique it, which might indirectly improve it, but it is pretty circuitous.  I did not put it in my original definition, but on request...

> 
> In general, I would prefer to switch to "anonymized" (and use a strict 
> definition of that) or return to using "unlinkable" (also with a 
> strict definition), rather than pollute the spec with behavioral 
> requirements inside the definition of terms.


As I understand it, we have at least the following candidates for what the term is:

de-identified - seriously confuses with the HIPAA standard <http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html#standard>
unlinkable - unfortunately, the world-wide-web is built using hyperlinks represented by URLs, only we're not talking of those kind of links anonymized - used in HIPAA (k-anonymity and so on), an existing "term of art" in the field, so a little dangerous out of scope - really much too vague; plenty of stuff is out of our scope, we're talking about specific steps to move data that is in scope, out of scope de-associated - hah!  I don't get any useful hits in Google!  would this work for us?

other ideas?


David Singer
Manager, Software Standards, Apple Inc.
Received on Thursday, 21 August 2014 10:37:10 UTC