RE: ACTION-412, Naming R/Y/G

Mike,

The goal here is to keep granular data but remove the linkage of the record to a device/user in the real world (and remove the possibility of reverse engineering the original production identifier).  If scrubbed appropriately, these records should find the middle ground between supporting needed reporting and NOT tracking a real person or device.  This of course requires an accountability model (risk-based) to combine technical, operational, and administrative controls to meet this goal.  Much like security infrastructures and privacy policy promises, there is a degree of trust imparted to the implementer to get it right.  But if they ever don't, then they are appropriately held accountability for this failing.

- Shane

From: Mike O'Neill [mailto:michael.oneill@baycloud.com]
Sent: Saturday, June 22, 2013 9:27 AM
To: 'Peter Swire'; public-tracking@w3.org
Subject: RE: ACTION-412, Naming R/Y/G

Peter.

Converting  a bit pattern encoding a persistent identifier to another using a one-way hash or any other one-to-one mapping just generates another persistent identifier. The next time the person/device/browser visits the domain the one-way function is applied again, and tracking continues completely unabated.

Moreover if persistent identifiers, before or after applying a one-way hash, are visible to passive examination of a data stream by a third-party (perhaps by methods such as same-origin script access  or fibre-optic data stream cloning),  the individual can be singled-out by that third-party.

This is a "null" scrubbing method.

If DNT is set  persistent identifiers should not be used unless for an accepted permitted use, and then they should exist (duration limited) for only as long as needed by the permitted use.

In my opinion, for a permitted use to be acceptable the duration of any persistent identifiers should be justified and must be measured in no more than hours.

Mike



From: Peter Swire [mailto:peter@peterswire.net]
Sent: 22 June 2013 16:30
To: public-tracking@w3.org<mailto:public-tracking@w3.org> Group WG
Subject: ACTION-412, Naming R/Y/G

If the group decides to use a Red/Yellow/Green approach, one question has been how to describe the three stages.  On the one hand, this may seem trivial because the substance means more than the name.  On the other hand, in my view, the names/descriptions are potentially important for two reasons: (1) they provide intellectual clarity about whatgoes in each group; and (2) they communicate the categories to a broader audience.

I was part of a briefing that Shane did on Friday on the phone to FTC participants including Ed Felten and Paul Ohm.  The briefing was similar to the approach Shane described at Sunnyvale.  In the move from red to yellow, here were examples of what could be scrubbed:

1.  Unique IDs, to one-way secret hash.
2.  IP address, to geo data.
3.  URL cleanse, remove suspect query string elements.
4.  Side facts, remove link out data that could be used to reverse identify the record.

Here are some ways that I've thought to describe what gets scrubbed, based on this sort of list:

1.  Remove identifiers (name) and what have been called pseudo-identifiers in the deID debates (phone, passwords, etc.).  But I don't think there is a generally accepted way to decide what pseudo-identifiers would be removed.

2.  Earlier, I had suggested "direct" and "indirect" identifiers, but I agree with Ed's objection that these definitions are vague.

3.  I am interested in the idea that going from red to yellow means removing information that is "exogenous" to the system operated by the company.  That is, for names/identifiers/data fields that are used outside of the company, scrub those.  Going to green would remove information that is "endogenous" to the system operated by the company, that is, even those within the company, with access to the system, could no longer reverse engineer the scrubbing.

When I suggested those terms on the call, someone basically said the terms were academic gobbledygook.  The terms are defined here: http://en.wikipedia.org/wiki/Exogenous.  I acknowledge the gobbledygood point, and the word "exogenous" is probably one only an economist could love.  But I welcome comments on whether the idea is correct - data fields that are generated or observable outside of the company are different from those generated within the company's system.

4.  If exogenous/endogenous are correct in theory, but gobbledygook in practice, then I wonder if there are plain language words that get at the same idea.  My best current attempt is that red to yellow means scrubbing fields that are "observable from outside of the company" or "outwardly observable."

So, my suggestion is that red to yellow means scrubbing fields that are "observable from outside of the company" or "outwardly observable."

If this is correct, then the concept of k-anonymity likely remains relevant.  Keeping broad demographic information such as male/female or age group can be in the yellow zone.  However, a left-handed person under five feet with red hair would in most settings be a bucket too small.

Clearly, the group has a variety of issues to address if we decide to go with a three-part R/Y/G approach to de-identification.  The limited goal of this post is to try to help with terminology.  Is it useful to say that the yellow zone means scrubbing data that is "observable from outside of the company", except for broad demographic data?

Peter

P.S.  After I wrote the above, I realized that "observable from outside of the company" is similar in meaning to what can be "tracked" by those outside of the company.  So scrubbing those items plausibly reduces tracking, at least by the other companies.


Prof. Peter P. Swire
C. William O'Neill Professor of Law
               Ohio State University
240.994.4142
www.peterswire.net<http://www.peterswire.net>

Beginning August 2013:
Nancy J. and Lawrence P. Huang Professor
Law and Ethics Program
Scheller College of Business
Georgia Institute of Technology

Received on Monday, 24 June 2013 03:59:19 UTC