Re: Deidentification (ISSUE-188) from David Singer on 2014-08-20 (public-tracking@w3.org from August 2014)

From: David Singer <singer@apple.com>
Date: Wed, 20 Aug 2014 16:40:16 -0700
To: "Roy T. Fielding" <fielding@gbiv.com>
Cc: Justin Brookman <jbrookman@cdt.org>, "public-tracking@w3.org WG" <public-tracking@w3.org>, TOUBIANA Vincent <vtoubiana@cnil.fr>
Message-id: <9F03D962-9EB9-4205-8545-4B2429AE4D6D@apple.com>
On Aug 20, 2014, at 16:25 , Roy T. Fielding <fielding@gbiv.com> wrote:

> On Aug 20, 2014, at 3:30 PM, Justin Brookman wrote:
>> On Aug 18, 2014, at 12:12 PM, David Singer <singer@apple.com> wrote:
>>> On Aug 17, 2014, at 6:35 , Mike O'Neill <michael.oneill@baycloud.com> wrote:
>>> 
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA1
>>>> 
>>>>>> b) the deidentification measures should be described in a form that is at least as available as the data (i.e. publicly, if the data itself will be made public).
>>>> 
>>>> Why not publicly any every case? If someone collects DNT data and intends to share privately it amongst their friends we should know how they shred the PII out of it.
>>> 
>>> OK.  The term {permanently deidentified} below is a candidate for being replaced by a new name of our choosing (e.g. “permanent non-tracking data”), here and where it is used.  How is this?  I made the second clause not a note, as it contains ‘should’ and ‘strongly recommended’ i.e. it is not merely informative.
>>> 
>>> * * * * *
>>> 
>>> Data is {permanently de-identified} (and hence out of the scope of this specification) when a sufficient combination of technical measures and restrictions ensures that the data does not, and cannot and will not be used to, identify a particular user, user-agent, or device.
>>> 
>>> In the case of dataset that contain records that relate to a single user or a small number of users:
>>> a) Usage and/or distribution restrictions are strongly recommended; experience has shown that such records can, in fact, sometimes be used to identify the user(s) despite the technical measures that were taken to prevent that happening.
>>> b) the deidentification measures should be described publicly (e.g. in the privacy policy).
>> 
>> OK, we agreed to present this as an option at the Call for Objection today.  I was not following IRC that closely today, but Nick indicated that Roy or Vincent may be satisfied with this option as well, or might be willing to withdraw their proposals as well.  Roy, Vincent, let me know what you want to do, and we can proceed to a CfO on this issue.
> 
> I don't know what to make of that.  Behavioral requirements do not belong
> in definitions unless they have the effect of partitioning a set of subjects
> as being in or out of the definition.  A valid definition cannot have
> any false negatives, which is what you get when data is de-identified but
> the behavioral requirements are not met.
> 
> I do not believe that industry will describe their de-identification
> measures publicly; certainly not in a privacy policy.  There are just too
> many ways that a legal document like the privacy policy can get out of sync,
> since it requires a great deal of corporate review.  What the policy does
> is define the black box requirements, and then the technical folks are
> instructed to adhere to those requirements at a minimum.  The actual
> technical procedures implemented in practice are often more
> privacy-preserving than what is publicly declared in a policy and
> vary depending on which application is being discussed.
> 
> Furthermore, we are not talking about public data.  The fact that many
> lawyers would prefer to have more transparency into corporate business
> practices is hardly a justification for additional requirements.
> What matters is the end result, not how a company might get there.
> 
> Regardless, nothing in the spec prevents companies from describing
> their de-identification measures in a privacy policy. If there is
> value for them to do so (as there is for EFF), then that value should
> be justification enough without further imposition by this WG.
> If legislative or regulatory bodies want to impose that kind of obligation,
> they have the power to do so (usually subject to more responsible
> oversight and public feedback than a W3C spec).

Thanks Roy

so, I think we should split out the (b) clause and make it a separate question on the consensus.  I agree, it doesn’t make a difference to the quality of the {anonymization} to describe what you did — it just enables people to critique it, which might indirectly improve it, but it is pretty circuitous.  I did not put it in my original definition, but on request...

> 
> In general, I would prefer to switch to "anonymized" (and use a
> strict definition of that) or return to using "unlinkable" (also
> with a strict definition), rather than pollute the spec with behavioral
> requirements inside the definition of terms.


As I understand it, we have at least the following candidates for what the term is:

de-identified — seriously confuses with the HIPAA standard <http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html#standard>
unlinkable — unfortunately, the world-wide-web is built using hyperlinks represented by URLs, only we’re not talking of those kind of links
anonymized — used in HIPAA (k-anonymity and so on), an existing “term of art” in the field, so a little dangerous
out of scope — really much too vague; plenty of stuff is out of our scope, we’re talking about specific steps to move data that is in scope, out of scope
de-associated — hah!  I don’t get any useful hits in Google!  would this work for us?

other ideas?


David Singer
Manager, Software Standards, Apple Inc.
Received on Wednesday, 20 August 2014 23:40:52 UTC