Re: Deidentification (ISSUE-188)

On Aug 14, 2014, at 16:04 , Rob van Eijk <rob@blaeu.com> wrote:

> 
> If the definition gets adopted, wouldn't it be fair to the user to include text with a normative MUST for a party to provide detailed information about the details of the de-identification process(es) it applies? Transparency should do it's work to prevent "de-identification by obscurity".
> 
> Is the group willing to consider such a normative obligation?
> 

On Aug 15, 2014, at 9:25 , Lee Tien <tien@eff.org> wrote:

> EFF agrees: transparency in de-identification methods is very important and is far superior for users than the old-school "expert certification without showing your work" approach.  
> 


I can’t answer for the group, but there are a few points to ponder.

It could be a best practice to describe what you do, especially in the case of data sets that have per-user records.  Researchers love to critique those.  (See below).

But, on the other hand, there are a myriad ways in which data that was identifiable gets deidentified.  How far do they have to trace it, and how many ways? 

"We count the number of visitors coming from the major web browsers, as aggregate counts.  Separately, we log the US state, or country, and visit date (but not time) of every visitor.  We keep separate aggregate buckets of the number of visitors we estimate to be aged 0-16 years old, 16-21, 21-30, 31-50, and 50+.  For every visit, we record the date/time that an ad was served, and what ad was served (this is the only database with per-visit records). [[and so on]]"

It sounds as thought you are supportive of the text, but want an additional requirement for some kinds (all kinds?) of data.  Can you express what that is? Perhaps added to the note on per-user datasets?  I give it a try below.


On Aug 15, 2014, at 9:07 , Mike O'Neill <michael.oneill@baycloud.com> wrote:

> As I said, I do not think the old definition of de-identified works for the third-party compliance section (or any statement describing data as out-of-scope of DNT). It assumes that identifying (tracking) data has been collected and some process other than deletion can be applied to it to make it safe.

That is one of the cases, but in general yes, the use of the term is only of interest to us to describe what happened to in-scope data to make it out-of-scope.  We are not interested in data that was never in scope, and we handle data that remains in scope elsewhere.

> I suggested we use a new definition for out-of-scope e.g. anonymous data (mathematically impossible to derive identity from it, or being linked to an individual in a subsequent network interaction), and leaving the definition of the de-identifying process for the permitted use section (data collected unknowingly in error should just be deleted). 

I don’t mind what term we use for it.  We can invent our own new word if we like (‘noa’). It’s the concept we need to nail down.  I suggest a new phrase below.

> I agree your "data does not, and cannot and will not " implies impossibility, and the dreaded "reasonable" has gone which is good. Though the non-normative bit counteracts that somewhat by calling for distribution restrictions (which are not needed if the data "cannot" be re-identified).

You ‘cannot’ because it’s both believed impossible and you are not allowed to try (some suitable combination).  The note explains that you probably want to be restrictive on datasets that contain per-user records.  The ‘cannot’ is reflecting both the lack of an ability (possibility) and the lack of permission.

> I agree with Rob that a new definition would probably be superfluous given our definition of tracking implying in-scope data as : "..  data regarding a particular user's activity across multiple distinct contexts".
> 
> The problem I have is that with the other-contexts qualification machine discoverability becomes tricky.  This could create a loophole if collected data with a UID is out-of-scope  when the controller promises to wear tunnel-vision glasses.

If it’s possible (by looking up the UID in some dataset) then I don’t think the data is deidentified.  That’s like saying I don’t have a martini because I keep the gin and vermouth separate.


* * * * 

Actually, Mike’s point that it doesn’t apparently correspond to the definition of tracking is well-taken. On the face of it, it should say that the data can no longer associate the user with another context; but of course, you are about to give the data away to another context, or publicly and hence to all other contexts, and the data is (by virtue of its origins) associated with your context as its origin. The only way to have it not associate the user with a context that is not the recipient is to have it not identify the user at all, which is what we have.  Here I re-state with an attempt to respond to Rob and Lee:

* * * *

Data is permanently de-identified (and hence out of the scope of this specification) when a sufficient combination of technical measures and restrictions ensures that the data does not, and cannot and will not be used to, identify a particular user, user-agent, or device.

Note: In the case of dataset that contain records that relate to a single user or a small number of users:
  a) Usage and/or distribution restrictions are strongly recommended; experience has shown that such records can, in fact, sometimes be used to identify the user(s) despite the technical measures that were taken to prevent that happening.
  b) the deidentification measures should be described in a form that is at least as available as the data (i.e. publicly, if the data itself will be made public).

* * * *

Would people prefer a term like “permanent non-tracking data” for this definition, and not (re-) or (ab-) use the existing term “deidentified”?


David Singer
Manager, Software Standards, Apple Inc.

Received on Friday, 15 August 2014 22:35:49 UTC