Re: Deidentification (ISSUE-188) from Nicholas Doty on 2014-08-21 (public-tracking@w3.org from August 2014)

From: Nicholas Doty <npdoty@w3.org>
Date: Wed, 20 Aug 2014 19:09:10 -0700
To: David Singer <singer@apple.com>
Cc: "Roy T. Fielding" <fielding@gbiv.com>, Justin Brookman <jbrookman@cdt.org>, "public-tracking@w3.org WG" <public-tracking@w3.org>, TOUBIANA Vincent <vtoubiana@cnil.fr>
Message-Id: <39371F84-CC6C-4387-8F22-DBC84C291BBA@w3.org>
On August 20, 2014, at 4:40 PM, David Singer <singer@apple.com> wrote:
> On Aug 20, 2014, at 16:25 , Roy T. Fielding <fielding@gbiv.com> wrote:
>> On Aug 20, 2014, at 3:30 PM, Justin Brookman wrote:
>>> On Aug 18, 2014, at 12:12 PM, David Singer <singer@apple.com> wrote:
>>>> On Aug 17, 2014, at 6:35 , Mike O'Neill <michael.oneill@baycloud.com> wrote:
>>>> 
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA1
>>>>> 
>>>>>>> b) the deidentification measures should be described in a form that is at least as available as the data (i.e. publicly, if the data itself will be made public).
>>>>> 
>>>>> Why not publicly any every case? If someone collects DNT data and intends to share privately it amongst their friends we should know how they shred the PII out of it.
>>>> 
>>>> OK.  The term {permanently deidentified} below is a candidate for being replaced by a new name of our choosing (e.g. “permanent non-tracking data”), here and where it is used.  How is this?  I made the second clause not a note, as it contains ‘should’ and ‘strongly recommended’ i.e. it is not merely informative.
>>>> 
>>>> * * * * *
>>>> 
>>>> Data is {permanently de-identified} (and hence out of the scope of this specification) when a sufficient combination of technical measures and restrictions ensures that the data does not, and cannot and will not be used to, identify a particular user, user-agent, or device.
>>>> 
>>>> In the case of dataset that contain records that relate to a single user or a small number of users:
>>>> a) Usage and/or distribution restrictions are strongly recommended; experience has shown that such records can, in fact, sometimes be used to identify the user(s) despite the technical measures that were taken to prevent that happening.
>>>> b) the deidentification measures should be described publicly (e.g. in the privacy policy).
>>> 
>>> OK, we agreed to present this as an option at the Call for Objection today.  I was not following IRC that closely today, but Nick indicated that Roy or Vincent may be satisfied with this option as well, or might be willing to withdraw their proposals as well.  Roy, Vincent, let me know what you want to do, and we can proceed to a CfO on this issue.
>> 
>> I don't know what to make of that.  

I apologize if I misread your openness in IRC, Roy.

>> Behavioral requirements do not belong
>> in definitions unless they have the effect of partitioning a set of subjects
>> as being in or out of the definition.  A valid definition cannot have
>> any false negatives, which is what you get when data is de-identified but
>> the behavioral requirements are not met.
>> 
>> I do not believe that industry will describe their de-identification
>> measures publicly; certainly not in a privacy policy.  There are just too
>> many ways that a legal document like the privacy policy can get out of sync,
>> since it requires a great deal of corporate review.  What the policy does
>> is define the black box requirements, and then the technical folks are
>> instructed to adhere to those requirements at a minimum.  The actual
>> technical procedures implemented in practice are often more
>> privacy-preserving than what is publicly declared in a policy and
>> vary depending on which application is being discussed.
>> 
>> Furthermore, we are not talking about public data.  The fact that many
>> lawyers would prefer to have more transparency into corporate business
>> practices is hardly a justification for additional requirements.
>> What matters is the end result, not how a company might get there.
>> 
>> Regardless, nothing in the spec prevents companies from describing
>> their de-identification measures in a privacy policy. If there is
>> value for them to do so (as there is for EFF), then that value should
>> be justification enough without further imposition by this WG.
>> If legislative or regulatory bodies want to impose that kind of obligation,
>> they have the power to do so (usually subject to more responsible
>> oversight and public feedback than a W3C spec).
> 
> Thanks Roy
> 
> so, I think we should split out the (b) clause and make it a separate question on the consensus.  I agree, it doesn’t make a difference to the quality of the {anonymization} to describe what you did — it just enables people to critique it, which might indirectly improve it, but it is pretty circuitous.  I did not put it in my original definition, but on request...

It might make more sense (editorially and in terms of helping implementers read and apply the document) to move a requirement for transparency of deidentification process into a server compliance requirement necessary if a server is complying with a user's preference by deidentifying relevant data. And yes, it might be an orthogonal question for the group. As I understand Roy's email, he objects to the requirement whether in the definition or elsewhere.

>> In general, I would prefer to switch to "anonymized" (and use a
>> strict definition of that) or return to using "unlinkable" (also
>> with a strict definition), rather than pollute the spec with behavioral
>> requirements inside the definition of terms.
> 
> As I understand it, we have at least the following candidates for what the term is:
> 
> de-identified — seriously confuses with the HIPAA standard <http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html#standard>
> unlinkable — unfortunately, the world-wide-web is built using hyperlinks represented by URLs, only we’re not talking of those kind of links
> anonymized — used in HIPAA (k-anonymity and so on), an existing “term of art” in the field, so a little dangerous
> out of scope — really much too vague; plenty of stuff is out of our scope, we’re talking about specific steps to move data that is in scope, out of scope
> de-associated — hah!  I don’t get any useful hits in Google!  would this work for us?

Regarding naming (which we seem to accept is distinct from the exact definition of processing data to remove connections with users such that the data is out of scope of the requirements of complying with a user's expressed tracking preference):

I don't particularly see the problems with "deidentified". The 2012 FTC report uses the term in a similar way (as part of its "reasonably linkable" scoping). HIPAA's use of the term has particular standards (safe harbor and expert review) for their definition, but it's also used for a very similar purpose: not just public disclosure but also processing data into a less sensitive form (but still with individual records) for secondary uses such as research, which is much how we've been thinking of it here. Jack has proposed a definition for the term in question that would follow the HIPAA structure more closely; I've heard some concerns about the detail or prescriptiveness of that proposal, but not that it's apparently a definition of a completely different term. I see that the DAA also uses "de-identification" in their principles, citing the FTC and COPPA.

I mentioned on the call today that there is a concern about "anonymous" because it's used so inconsistently; since for some people "anonymous" means "pseudonymous" or "doesn't have real name attached" and others think it means "aggregated or otherwise not revealing information about any individual". That was also a topic of discussion at our meeting in February 2013 when several breakout groups chose to focus on the "deidentified" term rather than, for example, "anonymous"/"anonymized", which had been a specific conversation in my breakout group and perhaps others.
	http://www.w3.org/2013/02/12-dnt-minutes#item01
Indeed, I thought the question had been closed Spring 2013. I believe all drafts since then have used "deidentified", as did all the change proposals in June and October of last year.

All that being said, we could add an adverb to "deidentified" if we thought there was likely to be ambiguity and then use the phrase in all cases. I suggested "sufficiently deidentified" as a defined phrase, that we can refer back to later in the document, implying specifically that it's sufficient for the purposes of this recommendation. (David suggested "permanently deidentified"; or similarly we could say "persistently deidentified".)

Thanks,
Nick
Received on Thursday, 21 August 2014 02:09:54 UTC