Re: Preference for no change on deidentification language from David Singer on 2013-06-25 (public-tracking@w3.org from June 2013)

From: David Singer <singer@apple.com>
Date: Tue, 25 Jun 2013 14:29:11 -0700
To: Justin Brookman <jbrookman@cdt.org>
Cc: Shane Wiley <wileys@yahoo-inc.com>, "public-tracking@w3.org Group WG" <public-tracking@w3.org>
Message-id: <D0BD0C24-1A5D-4182-A87B-B05426F9F4F9@apple.com>
On Jun 25, 2013, at 13:27 , Justin Brookman <jbrookman@cdt.org> wrote:

> My apologies if I misunderstood the proposal.  As I have suggested before, I would be fine with an approach that requires that the current permitted uses --- or some of those permitted uses, like financial logging --- be only done with yellow band (or green band, I guess) data.  Indeed, I think it might be a good idea to require that *all* permitted uses be done with data that cannot be attributed to any particular individual (pseudonymous and/or exogenous data) when DNT:1 is turned on.  

But we already have the requirement that for any permitted use, you retain only the data you need, and only as long as you need.  So if you could have accomplished the use with yellow data, but you kept raw, you're not in compliance.

I think specific methods and models of de-id are interesting -- and we probably should talk about them -- but I prefer that the spec. focus on the outcome.

> If it would be helpful to put that idea into a CHANGE proposal, I will do so, but that's not how I've heard three-state discussed.  I would object to allowing for research and cross-site analytics within a yellow range that could allow the data to be correlated back to a user or device.
> 
> I think that a clean test that mirrors the FTC and DAA guidance is preferable to anything-goes-within-yellow-band.  It also gives companies the leeway to construct their systems in a variety of ways, but they'll have the obligation of defending those practices as meeting the plain language of the standard.
> 
> On Jun 25, 2013, at 4:08 PM, Shane Wiley <wileys@yahoo-inc.com> wrote:
> 
>> Justin,
>>  
>> If all Permitted Uses are employable on raw data (as the current June draft allows), how is moving several of the Permitted Uses to only be allowed on a middle-state (de-identified but linkable) as “less privacy-protective practice”?  The alternative seems to suggest otherwise.
>>  
>> - Shane
>>  
>> From: Justin Brookman [mailto:jbrookman@cdt.org] 
>> Sent: Tuesday, June 25, 2013 1:02 PM
>> To: Peter Swire
>> Cc: public-tracking@w3.org Group WG
>> Subject: Preference for no change on deidentification language
>>  
>> I would like to register my preference for the current June draft text on deidentification over the three-state model described below.  I've come to the conclusion that three states doesn't solve our fundamental disconnect and has the added problems of both adding inappropriately prescriptive text as well as potentially providing greater cover for less privacy-protective practices.
>>  
>> On Jun 22, 2013, at 11:29 AM, Peter Swire <peter@peterswire.net> wrote:
>> 
>> 
>> If the group decides to use a Red/Yellow/Green approach, one question has been how to describe the three stages.  On the one hand, this may seem trivial because the substance means more than the name.  On the other hand, in my view, the names/descriptions are potentially important for two reasons: (1) they provide intellectual clarity about whatgoes in each group; and (2) they communicate the categories to a broader audience.
>>  
>> I was part of a briefing that Shane did on Friday on the phone to FTC participants including Ed Felten and Paul Ohm.  The briefing was similar to the approach Shane described at Sunnyvale.  In the move from red to yellow, here were examples of what could be scrubbed:
>>  
>> 1.  Unique IDs, to one-way secret hash.
>> 2.  IP address, to geo data.
>> 3.  URL cleanse, remove suspect query string elements.
>> 4.  Side facts, remove link out data that could be used to reverse identify the record.
>>  
>> Here are some ways that I’ve thought to describe what gets scrubbed, based on this sort of list:
>>  
>> 1.  Remove identifiers (name) and what have been called pseudo-identifiers in the deID debates (phone, passwords, etc.).  But I don’t think there is a generally accepted way to decide what pseudo-identifiers would be removed.
>>  
>> 2.  Earlier, I had suggested “direct” and “indirect” identifiers, but I agree with Ed’s objection that these definitions are vague.
>>  
>> 3.  I am interested in the idea that going from red to yellow means removing information that is “exogenous” to the system operated by the company.  That is, for names/identifiers/data fields that are used outside of the company, scrub those.  Going to green would remove information that is “endogenous” to the system operated by the company, that is, even those within the company, with access to the system, could no longer reverse engineer the scrubbing.
>>  
>> When I suggested those terms on the call, someone basically said the terms were academic gobbledygook.  The terms are defined here: http://en.wikipedia.org/wiki/Exogenous.  I acknowledge the gobbledygood point, and the word “exogenous” is probably one only an economist could love.  But I welcome comments on whether the idea is correct – data fields that are generated or observable outside of the company are different from those generated within the company’s system.
>>  
>> 4.  If exogenous/endogenous are correct in theory, but gobbledygook in practice, then I wonder if there are plain language words that get at the same idea.  My best current attempt is that red to yellow means scrubbing fields that are “observable from outside of the company” or “outwardly observable.”
>>  
>> So, my suggestion is that red to yellow means scrubbing fields that are “observable from outside of the company” or “outwardly observable.”
>>  
>> If this is correct, then the concept of k-anonymity likely remains relevant.  Keeping broad demographic information such as male/female or age group can be in the yellow zone.  However, a left-handed person under five feet with red hair would in most settings be a bucket too small.
>>  
>> Clearly, the group has a variety of issues to address if we decide to go with a three-part R/Y/G approach to de-identification.  The limited goal of this post is to try to help with terminology.  Is it useful to say that the yellow zone means scrubbing data that is “observable from outside of the company”, except for broad demographic data?
>>  
>> Peter
>>  
>> P.S.  After I wrote the above, I realized that "observable from outside of the company" is similar in meaning to what can be "tracked" by those outside of the company.  So scrubbing those items plausibly reduces tracking, at least by the other companies.
>>  
>>  
>> Prof. Peter P. Swire
>> C. William O'Neill Professor of Law
>>                 Ohio State University
>> 240.994.4142
>> www.peterswire.net
>>  
>> Beginning August 2013:
>> Nancy J. and Lawrence P. Huang Professor
>> Law and Ethics Program
>> Scheller College of Business
>> Georgia Institute of Technology
>>  
>>  
>>  
> 

David Singer
Multimedia and Software Standards, Apple Inc.
Received on Tuesday, 25 June 2013 21:29:42 UTC