Re: ACTION-412, Naming R/Y/G from Rob van Eijk on 2013-06-24 (public-tracking@w3.org from June 2013)

From: Rob van Eijk <rob@blaeu.com>
Date: Mon, 24 Jun 2013 12:09:41 +0200
To: Dan Auerbach <dan@eff.org>, public-tracking@w3.org
Message-ID: <97f1bcb1-6f84-41b4-8efd-216c43576eff@email.android.com>
Dear Peter, Dan, Shane,


* On the naming of the end-state:
For me the Y is not the end-state and should therefore not be named de-identified. In a 3 state model, the G is the de-identified end-state. In a 2 state model, de-identified is the second state.

* On the definition of de-identified:
I support the more open definition of de-identifed of Dan/Lee.: Data can be considered de-identified if it has been deleted, modified, aggregated, anonymized or otherwise manipulated in order to achieve a reasonable level of justified confidence that the data cannot reasonably be used to infer information about, or otherwise be linked to, a particular user, user agent, or device. 
The definition draws a clear line in the sand with regards to the quality of the data in the end state of a data scrubbing process. This mail is written with this definition in mind.

* Non-normative remark 3, June draft change proposal de-identification:
If data is de-identified, is can be shared with other parties regardless of the DNT expression, but with one condition: the obligation to regularly assess and manage the risk of re-identification. This is addressed by Dan in the non-normative remark 3

* Text proposal for remark 4:
<text>
Data is fully de-identified when any party, including the party performing de-identification with knowledge of i.e. the hashing algorithm and salt.
</text>

* On the issue of a 2 or 3 state approach:
The issue at hand is to add permitted used to the states. This has not been identified yet in the change proposal. I share the view of Dan, that hashing a hash is a null. But there are many elements in raw data that are not hashed, even elements that may be derived from the protocol. Having an intermediary step has it's merit in my view, since scrubbing the data reduces the risk of data disclosure in case of e.g. a data breach. Scrubbing data into an intermediary format addresses a reasonable level of protection. 
Another reason is that having a 3 state approach allows for mapping permitted uses to either the RAW state or the intermediary state.  

* On the issue of unique ID's for permitted uses an linkability versus de-identified:
Mapping a permitted use to a state of de-identified data is not logical to me. If you have a permitted use, it's data purpose must be well defined. I work under the assumption that new data should be able to be linked to data on file for a permitted use. In a truly de-identified end state this functionality would not be possible. In a truly de-identified state, data can no longer be linked to data already collected on file. If DNT:1 no data, except for the permitted uses must be shared in linkable form.

<text>
Mapping of permitted uses to a 3 state approach:  
R: (RAW state, still linkable): Security/Fraud 
Y: (Intermediary state,  and still linkable): other permitted uses
G: (de-identified), no longer linkable: no permitted uses, data may be shared under the obligation to manage the risk of re-identification.
</text>

* On the issue of key concepts:
Text amendment/proposal of key concepts under the definition of de-identified by Dan:
<text>
* De-identification: a process towards un-linkability.
* Linkability: the ability to add new data to previously collected data.
</text>

Thanks,
Rob



Dan Auerbach <dan@eff.org> wrote:



Dan Auerbach <dan@eff.org> wrote:

>Hi Peter,
>
>I want to highlight that we still have a lot of work to do if we are to
>come to agreement on the de-identification issue. I think by moving
>from
>a 2-state to a 3-state de-identification process, we shifted where the
>disagreement was, but we didn't really resolve it. I am open to a
>3-state process, but think that 2 state is far simpler, and so would
>suggest if we are to have a last effort at trying to come to agreement,
>we return to the 2 state model and tackle the de-identification
>question
>head on. I will re-propose language for this for the June draft.
>
>Regarding the 3 state proposal, and setting aside nomenclature, we lack
>a clear definition of the "yellow" state both in terms of what standard
>data must meet before it is "yellow", and the extent to which "yellow"
>data is granted special privileges in terms of how it can be used and
>passed around. The more that yellow data is considered out of scope for
>DNT -- OK to keep for long stretches, OK to share with others -- the
>more rigorous we must be in ensuring that it is truly de-identified and
>unlinkable.
>
>With respect to attempts to define the yellow data, I have a few
>comments. First, de-identification is commonly understood to be a
>property of data. Adding things like "operational controls" into the
>mix
>muddies the waters, since now we are talking not about the data itself,
>but about its administration. If we go down this more complicated road,
>we have to be crystal clear about what such controls look like, and
>explicit that the data itself is in a state with significant
>re-identification risk.
>
>Next, I want very much to re-emphasize what Mike has said -- hashing a
>pseudonym to create another pseudonym is a null change. In the case of
>cookies, most attackers will not have access to the raw cookie string
>of
>other users, so going from 1234 to ABCD is not an improvement at all
>for
>privacy. Other suggested data manipulation strategies, like discarding
>IP addresses in favor of geolocations, do have an effect in helping
>reduce the privacy risk of a data set. But just suggesting a few ad hoc
>techniques does not amount to a standard, and without much more clarity
>about what the exact proposals entail, I don't think we can evaluate
>the
>extent to which the privacy risk is reduced.
>
>Finally, to address the idea of exogenous vs endogenous, perhaps this
>could be a promising direction, but I'll confess I'm not so sure of
>what
>you have in mind. For example, you write "data fields that are
>generated
>or observable outside of the company" would be exogenous, but wouldn't
>this include all user data, which is generated outside of the company?
>If an IP address comes in, but is reduced to a coarse geo location, is
>that exogenous or endogenous?
>
>Instead of trying to invent the conceptual tools to draw these lines, I
>think it's wise to borrow from the rich set of academic literature
>focused on these exact questions. As examining privacy risks of large
>data sets is a fairly new and rapidly evolving field, I think we
>shouldn't tie ourselves to any one particular technique, but this is
>still the right place to look for meaningful technical distinctions.
>That's why I favor non-normative language that stresses that data has
>no
>reasonable re-identification (or attribute disclosure) risk, while
>using
>non-normative examples that borrow from literature and use rigorous
>frameworks like k-anonymity.
>
>Dan
>
>On 06/22/2013 08:29 AM, Peter Swire wrote:
>>
>> If the group decides to use a Red/Yellow/Green approach, one question
>> has been how to describe the three stages.  On the one hand, this may
>> seem trivial because the substance means more than the name.  On the
>> other hand, in my view, the names/descriptions are potentially
>> important for two reasons: (1) they provide intellectual clarity
>about
>> whatgoes in each group; and (2) they communicate the categories to a
>> broader audience.
>>
>>  
>>
>> I was part of a briefing that Shane did on Friday on the phone to FTC
>> participants including Ed Felten and Paul Ohm.  The briefing was
>> similar to the approach Shane described at Sunnyvale.  In the move
>> from red to yellow, here were examples of what could be scrubbed:
>>
>>  
>>
>> 1.  Unique IDs, to one-way secret hash.
>>
>> 2.  IP address, to geo data.
>>
>> 3.  URL cleanse, remove suspect query string elements.
>>
>> 4.  Side facts, remove link out data that could be used to reverse
>> identify the record.
>>
>>  
>>
>> Here are some ways that I’ve thought to describe what gets scrubbed,
>> based on this sort of list:
>>
>>  
>>
>> 1.  Remove identifiers (name) and what have been called
>> pseudo-identifiers in the deID debates (phone, passwords, etc.).  But
>> I don’t think there is a generally accepted way to decide what
>> pseudo-identifiers would be removed.
>>
>>  
>>
>> 2.  Earlier, I had suggested “direct” and “indirect” identifiers, but
>> I agree with Ed’s objection that these definitions are vague.
>>
>>  
>>
>> 3.  I am interested in the idea that going from red to yellow means
>> removing information that is “exogenous” to the system operated by
>the
>> company.  That is, for names/identifiers/data fields that are used
>> outside of the company, scrub those.  Going to green would remove
>> information that is “endogenous” to the system operated by the
>> company, that is, even those within the company, with access to the
>> system, could no longer reverse engineer the scrubbing.
>>
>>  
>>
>> When I suggested those terms on the call, someone basically said the
>> terms were academic gobbledygook.  The terms are defined here:
>> http://en.wikipedia.org/wiki/Exogenous.  I acknowledge the
>> gobbledygood point, and the word “exogenous” is probably one only an
>> economist could love.  But I welcome comments on whether the idea is
>> correct – data fields that are generated or observable outside of the
>> company are different from those generated within the company’s
>system.
>>
>>  
>>
>> 4.  If exogenous/endogenous are correct in theory, but gobbledygook
>in
>> practice, then I wonder if there are plain language words that get at
>> the same idea.  My best current attempt is that red to yellow means
>> scrubbing fields that are “observable from outside of the company” or
>> “outwardly observable.”
>>
>>  
>>
>> _So, my suggestion is that red to yellow means scrubbing fields that
>> are “observable from outside of the company” or “outwardly
>observable.”_
>>
>>  
>>
>> If this is correct, then the concept of k-anonymity likely remains
>> relevant.  Keeping broad demographic information such as male/female
>> or age group can be in the yellow zone.  However, a left-handed
>person
>> under five feet with red hair would in most settings be a bucket too
>> small.
>>
>>  
>>
>> Clearly, the group has a variety of issues to address if we decide to
>> go with a three-part R/Y/G approach to de-identification.  The
>limited
>> goal of this post is to try to help with terminology.  Is it useful
>to
>> say that the yellow zone means scrubbing data that is “observable
>from
>> outside of the company”, except for broad demographic data?
>>
>>  
>>
>> Peter
>>
>>
>> P.S.  After I wrote the above, I realized that "observable from
>> outside of the company" is similar in meaning to what can be
>"tracked"
>> by those outside of the company.  So scrubbing those items plausibly
>> reduces tracking, at least by the other companies.
>>
>>
>>
>> Prof. Peter P. Swire
>> C. William O'Neill Professor of Law
>> Ohio State University
>> 240.994.4142
>> www.peterswire.net
>>
>> Beginning August 2013:
>> Nancy J. and Lawrence P. Huang Professor
>> Law and Ethics Program
>> Scheller College of Business
>> Georgia Institute of Technology
>>
Received on Monday, 24 June 2013 10:10:39 UTC