Re: ACTION-412, Naming R/Y/G from Dan Auerbach on 2013-06-24 (public-tracking@w3.org from June 2013)

From: Dan Auerbach <dan@eff.org>
Date: Sun, 23 Jun 2013 22:50:54 -0700
To: public-tracking@w3.org
Message-ID: <51C7DE3E.1050509@eff.org>
Hi Peter,

I want to highlight that we still have a lot of work to do if we are to
come to agreement on the de-identification issue. I think by moving from
a 2-state to a 3-state de-identification process, we shifted where the
disagreement was, but we didn't really resolve it. I am open to a
3-state process, but think that 2 state is far simpler, and so would
suggest if we are to have a last effort at trying to come to agreement,
we return to the 2 state model and tackle the de-identification question
head on. I will re-propose language for this for the June draft.

Regarding the 3 state proposal, and setting aside nomenclature, we lack
a clear definition of the "yellow" state both in terms of what standard
data must meet before it is "yellow", and the extent to which "yellow"
data is granted special privileges in terms of how it can be used and
passed around. The more that yellow data is considered out of scope for
DNT -- OK to keep for long stretches, OK to share with others -- the
more rigorous we must be in ensuring that it is truly de-identified and
unlinkable.

With respect to attempts to define the yellow data, I have a few
comments. First, de-identification is commonly understood to be a
property of data. Adding things like "operational controls" into the mix
muddies the waters, since now we are talking not about the data itself,
but about its administration. If we go down this more complicated road,
we have to be crystal clear about what such controls look like, and
explicit that the data itself is in a state with significant
re-identification risk.

Next, I want very much to re-emphasize what Mike has said -- hashing a
pseudonym to create another pseudonym is a null change. In the case of
cookies, most attackers will not have access to the raw cookie string of
other users, so going from 1234 to ABCD is not an improvement at all for
privacy. Other suggested data manipulation strategies, like discarding
IP addresses in favor of geolocations, do have an effect in helping
reduce the privacy risk of a data set. But just suggesting a few ad hoc
techniques does not amount to a standard, and without much more clarity
about what the exact proposals entail, I don't think we can evaluate the
extent to which the privacy risk is reduced.

Finally, to address the idea of exogenous vs endogenous, perhaps this
could be a promising direction, but I'll confess I'm not so sure of what
you have in mind. For example, you write "data fields that are generated
or observable outside of the company" would be exogenous, but wouldn't
this include all user data, which is generated outside of the company?
If an IP address comes in, but is reduced to a coarse geo location, is
that exogenous or endogenous?

Instead of trying to invent the conceptual tools to draw these lines, I
think it's wise to borrow from the rich set of academic literature
focused on these exact questions. As examining privacy risks of large
data sets is a fairly new and rapidly evolving field, I think we
shouldn't tie ourselves to any one particular technique, but this is
still the right place to look for meaningful technical distinctions.
That's why I favor non-normative language that stresses that data has no
reasonable re-identification (or attribute disclosure) risk, while using
non-normative examples that borrow from literature and use rigorous
frameworks like k-anonymity.

Dan

On 06/22/2013 08:29 AM, Peter Swire wrote:
>
> If the group decides to use a Red/Yellow/Green approach, one question
> has been how to describe the three stages.  On the one hand, this may
> seem trivial because the substance means more than the name.  On the
> other hand, in my view, the names/descriptions are potentially
> important for two reasons: (1) they provide intellectual clarity about
> whatgoes in each group; and (2) they communicate the categories to a
> broader audience.
>
>  
>
> I was part of a briefing that Shane did on Friday on the phone to FTC
> participants including Ed Felten and Paul Ohm.  The briefing was
> similar to the approach Shane described at Sunnyvale.  In the move
> from red to yellow, here were examples of what could be scrubbed:
>
>  
>
> 1.  Unique IDs, to one-way secret hash.
>
> 2.  IP address, to geo data.
>
> 3.  URL cleanse, remove suspect query string elements.
>
> 4.  Side facts, remove link out data that could be used to reverse
> identify the record.
>
>  
>
> Here are some ways that I’ve thought to describe what gets scrubbed,
> based on this sort of list:
>
>  
>
> 1.  Remove identifiers (name) and what have been called
> pseudo-identifiers in the deID debates (phone, passwords, etc.).  But
> I don’t think there is a generally accepted way to decide what
> pseudo-identifiers would be removed.
>
>  
>
> 2.  Earlier, I had suggested “direct” and “indirect” identifiers, but
> I agree with Ed’s objection that these definitions are vague.
>
>  
>
> 3.  I am interested in the idea that going from red to yellow means
> removing information that is “exogenous” to the system operated by the
> company.  That is, for names/identifiers/data fields that are used
> outside of the company, scrub those.  Going to green would remove
> information that is “endogenous” to the system operated by the
> company, that is, even those within the company, with access to the
> system, could no longer reverse engineer the scrubbing.
>
>  
>
> When I suggested those terms on the call, someone basically said the
> terms were academic gobbledygook.  The terms are defined here:
> http://en.wikipedia.org/wiki/Exogenous.  I acknowledge the
> gobbledygood point, and the word “exogenous” is probably one only an
> economist could love.  But I welcome comments on whether the idea is
> correct – data fields that are generated or observable outside of the
> company are different from those generated within the company’s system.
>
>  
>
> 4.  If exogenous/endogenous are correct in theory, but gobbledygook in
> practice, then I wonder if there are plain language words that get at
> the same idea.  My best current attempt is that red to yellow means
> scrubbing fields that are “observable from outside of the company” or
> “outwardly observable.”
>
>  
>
> _So, my suggestion is that red to yellow means scrubbing fields that
> are “observable from outside of the company” or “outwardly observable.”_
>
>  
>
> If this is correct, then the concept of k-anonymity likely remains
> relevant.  Keeping broad demographic information such as male/female
> or age group can be in the yellow zone.  However, a left-handed person
> under five feet with red hair would in most settings be a bucket too
> small.
>
>  
>
> Clearly, the group has a variety of issues to address if we decide to
> go with a three-part R/Y/G approach to de-identification.  The limited
> goal of this post is to try to help with terminology.  Is it useful to
> say that the yellow zone means scrubbing data that is “observable from
> outside of the company”, except for broad demographic data?
>
>  
>
> Peter
>
>
> P.S.  After I wrote the above, I realized that "observable from
> outside of the company" is similar in meaning to what can be "tracked"
> by those outside of the company.  So scrubbing those items plausibly
> reduces tracking, at least by the other companies.
>
>
>
> Prof. Peter P. Swire
> C. William O'Neill Professor of Law
> Ohio State University
> 240.994.4142
> www.peterswire.net
>
> Beginning August 2013:
> Nancy J. and Lawrence P. Huang Professor
> Law and Ethics Program
> Scheller College of Business
> Georgia Institute of Technology
>
Received on Monday, 24 June 2013 05:51:26 UTC