Re: ACTION-412, Naming R/Y/G from Rob van Eijk on 2013-06-24 (public-tracking@w3.org from June 2013)

From: Rob van Eijk <rob@blaeu.com>
Date: Mon, 24 Jun 2013 12:17:33 +0200
To: Dan Auerbach <dan@eff.org>, public-tracking@w3.org
Message-ID: <f9645214-f1c9-4a83-aabb-206d406343e1@email.android.com>
Please discard text proposal for remark 4.

copy/paste on table remains a hassle and went wrong.

Sorry,
Rob


Rob van Eijk <rob@blaeu.com> wrote:

>Dear Peter, Dan, Shane,
>
>
>* On the naming of the end-state:
>For me the Y is not the end-state and should therefore not be named
>de-identified. In a 3 state model, the G is the de-identified
>end-state. In a 2 state model, de-identified is the second state.
>
>* On the definition of de-identified:
>I support the more open definition of de-identifed of Dan/Lee.: Data
>can be considered de-identified if it has been deleted, modified,
>aggregated, anonymized or otherwise manipulated in order to achieve a
>reasonable level of justified confidence that the data cannot
>reasonably be used to infer information about, or otherwise be linked
>to, a particular user, user agent, or device. 
>The definition draws a clear line in the sand with regards to the
>quality of the data in the end state of a data scrubbing process. This
>mail is written with this definition in mind.
>
>* Non-normative remark 3, June draft change proposal de-identification:
>If data is de-identified, is can be shared with other parties
>regardless of the DNT expression, but with one condition: the
>obligation to regularly assess and manage the risk of
>re-identification. This is addressed by Dan in the non-normative remark
>3
>
>* Text proposal for remark 4:
><text>
>Data is fully de-identified when any party, including the party
>performing de-identification with knowledge of i.e. the hashing
>algorithm and salt.
></text>
>
>* On the issue of a 2 or 3 state approach:
>The issue at hand is to add permitted used to the states. This has not
>been identified yet in the change proposal. I share the view of Dan,
>that hashing a hash is a null. But there are many elements in raw data
>that are not hashed, even elements that may be derived from the
>protocol. Having an intermediary step has it's merit in my view, since
>scrubbing the data reduces the risk of data disclosure in case of e.g.
>a data breach. Scrubbing data into an intermediary format addresses a
>reasonable level of protection. 
>Another reason is that having a 3 state approach allows for mapping
>permitted uses to either the RAW state or the intermediary state.  
>
>* On the issue of unique ID's for permitted uses an linkability versus
>de-identified:
>Mapping a permitted use to a state of de-identified data is not logical
>to me. If you have a permitted use, it's data purpose must be well
>defined. I work under the assumption that new data should be able to be
>linked to data on file for a permitted use. In a truly de-identified
>end state this functionality would not be possible. In a truly
>de-identified state, data can no longer be linked to data already
>collected on file. If DNT:1 no data, except for the permitted uses must
>be shared in linkable form.
>
><text>
>Mapping of permitted uses to a 3 state approach:  
>R: (RAW state, still linkable): Security/Fraud 
>Y: (Intermediary state,  and still linkable): other permitted uses
>G: (de-identified), no longer linkable: no permitted uses, data may be
>shared under the obligation to manage the risk of re-identification.
></text>
>
>* On the issue of key concepts:
>Text amendment/proposal of key concepts under the definition of
>de-identified by Dan:
><text>
>* De-identification: a process towards un-linkability.
>* Linkability: the ability to add new data to previously collected
>data.
></text>
>
>Thanks,
>Rob
>
>
>
>Dan Auerbach <dan@eff.org> wrote:
>
>
>
>Dan Auerbach <dan@eff.org> wrote:
>
>>Hi Peter,
>>
>>I want to highlight that we still have a lot of work to do if we are
>to
>>come to agreement on the de-identification issue. I think by moving
>>from
>>a 2-state to a 3-state de-identification process, we shifted where the
>>disagreement was, but we didn't really resolve it. I am open to a
>>3-state process, but think that 2 state is far simpler, and so would
>>suggest if we are to have a last effort at trying to come to
>agreement,
>>we return to the 2 state model and tackle the de-identification
>>question
>>head on. I will re-propose language for this for the June draft.
>>
>>Regarding the 3 state proposal, and setting aside nomenclature, we
>lack
>>a clear definition of the "yellow" state both in terms of what
>standard
>>data must meet before it is "yellow", and the extent to which "yellow"
>>data is granted special privileges in terms of how it can be used and
>>passed around. The more that yellow data is considered out of scope
>for
>>DNT -- OK to keep for long stretches, OK to share with others -- the
>>more rigorous we must be in ensuring that it is truly de-identified
>and
>>unlinkable.
>>
>>With respect to attempts to define the yellow data, I have a few
>>comments. First, de-identification is commonly understood to be a
>>property of data. Adding things like "operational controls" into the
>>mix
>>muddies the waters, since now we are talking not about the data
>itself,
>>but about its administration. If we go down this more complicated
>road,
>>we have to be crystal clear about what such controls look like, and
>>explicit that the data itself is in a state with significant
>>re-identification risk.
>>
>>Next, I want very much to re-emphasize what Mike has said -- hashing a
>>pseudonym to create another pseudonym is a null change. In the case of
>>cookies, most attackers will not have access to the raw cookie string
>>of
>>other users, so going from 1234 to ABCD is not an improvement at all
>>for
>>privacy. Other suggested data manipulation strategies, like discarding
>>IP addresses in favor of geolocations, do have an effect in helping
>>reduce the privacy risk of a data set. But just suggesting a few ad
>hoc
>>techniques does not amount to a standard, and without much more
>clarity
>>about what the exact proposals entail, I don't think we can evaluate
>>the
>>extent to which the privacy risk is reduced.
>>
>>Finally, to address the idea of exogenous vs endogenous, perhaps this
>>could be a promising direction, but I'll confess I'm not so sure of
>>what
>>you have in mind. For example, you write "data fields that are
>>generated
>>or observable outside of the company" would be exogenous, but wouldn't
>>this include all user data, which is generated outside of the company?
>>If an IP address comes in, but is reduced to a coarse geo location, is
>>that exogenous or endogenous?
>>
>>Instead of trying to invent the conceptual tools to draw these lines,
>I
>>think it's wise to borrow from the rich set of academic literature
>>focused on these exact questions. As examining privacy risks of large
>>data sets is a fairly new and rapidly evolving field, I think we
>>shouldn't tie ourselves to any one particular technique, but this is
>>still the right place to look for meaningful technical distinctions.
>>That's why I favor non-normative language that stresses that data has
>>no
>>reasonable re-identification (or attribute disclosure) risk, while
>>using
>>non-normative examples that borrow from literature and use rigorous
>>frameworks like k-anonymity.
>>
>>Dan
>>
>>On 06/22/2013 08:29 AM, Peter Swire wrote:
>>>
>>> If the group decides to use a Red/Yellow/Green approach, one
>question
>>> has been how to describe the three stages.  On the one hand, this
>may
>>> seem trivial because the substance means more than the name.  On the
>>> other hand, in my view, the names/descriptions are potentially
>>> important for two reasons: (1) they provide intellectual clarity
>>about
>>> whatgoes in each group; and (2) they communicate the categories to a
>>> broader audience.
>>>
>>>  
>>>
>>> I was part of a briefing that Shane did on Friday on the phone to
>FTC
>>> participants including Ed Felten and Paul Ohm.  The briefing was
>>> similar to the approach Shane described at Sunnyvale.  In the move
>>> from red to yellow, here were examples of what could be scrubbed:
>>>
>>>  
>>>
>>> 1.  Unique IDs, to one-way secret hash.
>>>
>>> 2.  IP address, to geo data.
>>>
>>> 3.  URL cleanse, remove suspect query string elements.
>>>
>>> 4.  Side facts, remove link out data that could be used to reverse
>>> identify the record.
>>>
>>>  
>>>
>>> Here are some ways that I’ve thought to describe what gets scrubbed,
>>> based on this sort of list:
>>>
>>>  
>>>
>>> 1.  Remove identifiers (name) and what have been called
>>> pseudo-identifiers in the deID debates (phone, passwords, etc.). 
>But
>>> I don’t think there is a generally accepted way to decide what
>>> pseudo-identifiers would be removed.
>>>
>>>  
>>>
>>> 2.  Earlier, I had suggested “direct” and “indirect” identifiers,
>but
>>> I agree with Ed’s objection that these definitions are vague.
>>>
>>>  
>>>
>>> 3.  I am interested in the idea that going from red to yellow means
>>> removing information that is “exogenous” to the system operated by
>>the
>>> company.  That is, for names/identifiers/data fields that are used
>>> outside of the company, scrub those.  Going to green would remove
>>> information that is “endogenous” to the system operated by the
>>> company, that is, even those within the company, with access to the
>>> system, could no longer reverse engineer the scrubbing.
>>>
>>>  
>>>
>>> When I suggested those terms on the call, someone basically said the
>>> terms were academic gobbledygook.  The terms are defined here:
>>> http://en.wikipedia.org/wiki/Exogenous.  I acknowledge the
>>> gobbledygood point, and the word “exogenous” is probably one only an
>>> economist could love.  But I welcome comments on whether the idea is
>>> correct – data fields that are generated or observable outside of
>the
>>> company are different from those generated within the company’s
>>system.
>>>
>>>  
>>>
>>> 4.  If exogenous/endogenous are correct in theory, but gobbledygook
>>in
>>> practice, then I wonder if there are plain language words that get
>at
>>> the same idea.  My best current attempt is that red to yellow means
>>> scrubbing fields that are “observable from outside of the company”
>or
>>> “outwardly observable.”
>>>
>>>  
>>>
>>> _So, my suggestion is that red to yellow means scrubbing fields that
>>> are “observable from outside of the company” or “outwardly
>>observable.”_
>>>
>>>  
>>>
>>> If this is correct, then the concept of k-anonymity likely remains
>>> relevant.  Keeping broad demographic information such as male/female
>>> or age group can be in the yellow zone.  However, a left-handed
>>person
>>> under five feet with red hair would in most settings be a bucket too
>>> small.
>>>
>>>  
>>>
>>> Clearly, the group has a variety of issues to address if we decide
>to
>>> go with a three-part R/Y/G approach to de-identification.  The
>>limited
>>> goal of this post is to try to help with terminology.  Is it useful
>>to
>>> say that the yellow zone means scrubbing data that is “observable
>>from
>>> outside of the company”, except for broad demographic data?
>>>
>>>  
>>>
>>> Peter
>>>
>>>
>>> P.S.  After I wrote the above, I realized that "observable from
>>> outside of the company" is similar in meaning to what can be
>>"tracked"
>>> by those outside of the company.  So scrubbing those items plausibly
>>> reduces tracking, at least by the other companies.
>>>
>>>
>>>
>>> Prof. Peter P. Swire
>>> C. William O'Neill Professor of Law
>>> Ohio State University
>>> 240.994.4142
>>> www.peterswire.net
>>>
>>> Beginning August 2013:
>>> Nancy J. and Lawrence P. Huang Professor
>>> Law and Ethics Program
>>> Scheller College of Business
>>> Georgia Institute of Technology
>>>
Received on Monday, 24 June 2013 10:42:16 UTC