- From: Rob van Eijk <rob@blaeu.com>
- Date: Mon, 24 Jun 2013 12:17:33 +0200
- To: Dan Auerbach <dan@eff.org>, public-tracking@w3.org
- Message-ID: <f9645214-f1c9-4a83-aabb-206d406343e1@email.android.com>
Please discard text proposal for remark 4. copy/paste on table remains a hassle and went wrong. Sorry, Rob Rob van Eijk <rob@blaeu.com> wrote: >Dear Peter, Dan, Shane, > > >* On the naming of the end-state: >For me the Y is not the end-state and should therefore not be named >de-identified. In a 3 state model, the G is the de-identified >end-state. In a 2 state model, de-identified is the second state. > >* On the definition of de-identified: >I support the more open definition of de-identifed of Dan/Lee.: Data >can be considered de-identified if it has been deleted, modified, >aggregated, anonymized or otherwise manipulated in order to achieve a >reasonable level of justified confidence that the data cannot >reasonably be used to infer information about, or otherwise be linked >to, a particular user, user agent, or device. >The definition draws a clear line in the sand with regards to the >quality of the data in the end state of a data scrubbing process. This >mail is written with this definition in mind. > >* Non-normative remark 3, June draft change proposal de-identification: >If data is de-identified, is can be shared with other parties >regardless of the DNT expression, but with one condition: the >obligation to regularly assess and manage the risk of >re-identification. This is addressed by Dan in the non-normative remark >3 > >* Text proposal for remark 4: ><text> >Data is fully de-identified when any party, including the party >performing de-identification with knowledge of i.e. the hashing >algorithm and salt. ></text> > >* On the issue of a 2 or 3 state approach: >The issue at hand is to add permitted used to the states. This has not >been identified yet in the change proposal. I share the view of Dan, >that hashing a hash is a null. But there are many elements in raw data >that are not hashed, even elements that may be derived from the >protocol. Having an intermediary step has it's merit in my view, since >scrubbing the data reduces the risk of data disclosure in case of e.g. >a data breach. Scrubbing data into an intermediary format addresses a >reasonable level of protection. >Another reason is that having a 3 state approach allows for mapping >permitted uses to either the RAW state or the intermediary state. > >* On the issue of unique ID's for permitted uses an linkability versus >de-identified: >Mapping a permitted use to a state of de-identified data is not logical >to me. If you have a permitted use, it's data purpose must be well >defined. I work under the assumption that new data should be able to be >linked to data on file for a permitted use. In a truly de-identified >end state this functionality would not be possible. In a truly >de-identified state, data can no longer be linked to data already >collected on file. If DNT:1 no data, except for the permitted uses must >be shared in linkable form. > ><text> >Mapping of permitted uses to a 3 state approach: >R: (RAW state, still linkable): Security/Fraud >Y: (Intermediary state, and still linkable): other permitted uses >G: (de-identified), no longer linkable: no permitted uses, data may be >shared under the obligation to manage the risk of re-identification. ></text> > >* On the issue of key concepts: >Text amendment/proposal of key concepts under the definition of >de-identified by Dan: ><text> >* De-identification: a process towards un-linkability. >* Linkability: the ability to add new data to previously collected >data. ></text> > >Thanks, >Rob > > > >Dan Auerbach <dan@eff.org> wrote: > > > >Dan Auerbach <dan@eff.org> wrote: > >>Hi Peter, >> >>I want to highlight that we still have a lot of work to do if we are >to >>come to agreement on the de-identification issue. I think by moving >>from >>a 2-state to a 3-state de-identification process, we shifted where the >>disagreement was, but we didn't really resolve it. I am open to a >>3-state process, but think that 2 state is far simpler, and so would >>suggest if we are to have a last effort at trying to come to >agreement, >>we return to the 2 state model and tackle the de-identification >>question >>head on. I will re-propose language for this for the June draft. >> >>Regarding the 3 state proposal, and setting aside nomenclature, we >lack >>a clear definition of the "yellow" state both in terms of what >standard >>data must meet before it is "yellow", and the extent to which "yellow" >>data is granted special privileges in terms of how it can be used and >>passed around. The more that yellow data is considered out of scope >for >>DNT -- OK to keep for long stretches, OK to share with others -- the >>more rigorous we must be in ensuring that it is truly de-identified >and >>unlinkable. >> >>With respect to attempts to define the yellow data, I have a few >>comments. First, de-identification is commonly understood to be a >>property of data. Adding things like "operational controls" into the >>mix >>muddies the waters, since now we are talking not about the data >itself, >>but about its administration. If we go down this more complicated >road, >>we have to be crystal clear about what such controls look like, and >>explicit that the data itself is in a state with significant >>re-identification risk. >> >>Next, I want very much to re-emphasize what Mike has said -- hashing a >>pseudonym to create another pseudonym is a null change. In the case of >>cookies, most attackers will not have access to the raw cookie string >>of >>other users, so going from 1234 to ABCD is not an improvement at all >>for >>privacy. Other suggested data manipulation strategies, like discarding >>IP addresses in favor of geolocations, do have an effect in helping >>reduce the privacy risk of a data set. But just suggesting a few ad >hoc >>techniques does not amount to a standard, and without much more >clarity >>about what the exact proposals entail, I don't think we can evaluate >>the >>extent to which the privacy risk is reduced. >> >>Finally, to address the idea of exogenous vs endogenous, perhaps this >>could be a promising direction, but I'll confess I'm not so sure of >>what >>you have in mind. For example, you write "data fields that are >>generated >>or observable outside of the company" would be exogenous, but wouldn't >>this include all user data, which is generated outside of the company? >>If an IP address comes in, but is reduced to a coarse geo location, is >>that exogenous or endogenous? >> >>Instead of trying to invent the conceptual tools to draw these lines, >I >>think it's wise to borrow from the rich set of academic literature >>focused on these exact questions. As examining privacy risks of large >>data sets is a fairly new and rapidly evolving field, I think we >>shouldn't tie ourselves to any one particular technique, but this is >>still the right place to look for meaningful technical distinctions. >>That's why I favor non-normative language that stresses that data has >>no >>reasonable re-identification (or attribute disclosure) risk, while >>using >>non-normative examples that borrow from literature and use rigorous >>frameworks like k-anonymity. >> >>Dan >> >>On 06/22/2013 08:29 AM, Peter Swire wrote: >>> >>> If the group decides to use a Red/Yellow/Green approach, one >question >>> has been how to describe the three stages. On the one hand, this >may >>> seem trivial because the substance means more than the name. On the >>> other hand, in my view, the names/descriptions are potentially >>> important for two reasons: (1) they provide intellectual clarity >>about >>> whatgoes in each group; and (2) they communicate the categories to a >>> broader audience. >>> >>> >>> >>> I was part of a briefing that Shane did on Friday on the phone to >FTC >>> participants including Ed Felten and Paul Ohm. The briefing was >>> similar to the approach Shane described at Sunnyvale. In the move >>> from red to yellow, here were examples of what could be scrubbed: >>> >>> >>> >>> 1. Unique IDs, to one-way secret hash. >>> >>> 2. IP address, to geo data. >>> >>> 3. URL cleanse, remove suspect query string elements. >>> >>> 4. Side facts, remove link out data that could be used to reverse >>> identify the record. >>> >>> >>> >>> Here are some ways that I’ve thought to describe what gets scrubbed, >>> based on this sort of list: >>> >>> >>> >>> 1. Remove identifiers (name) and what have been called >>> pseudo-identifiers in the deID debates (phone, passwords, etc.). >But >>> I don’t think there is a generally accepted way to decide what >>> pseudo-identifiers would be removed. >>> >>> >>> >>> 2. Earlier, I had suggested “direct” and “indirect” identifiers, >but >>> I agree with Ed’s objection that these definitions are vague. >>> >>> >>> >>> 3. I am interested in the idea that going from red to yellow means >>> removing information that is “exogenous” to the system operated by >>the >>> company. That is, for names/identifiers/data fields that are used >>> outside of the company, scrub those. Going to green would remove >>> information that is “endogenous” to the system operated by the >>> company, that is, even those within the company, with access to the >>> system, could no longer reverse engineer the scrubbing. >>> >>> >>> >>> When I suggested those terms on the call, someone basically said the >>> terms were academic gobbledygook. The terms are defined here: >>> http://en.wikipedia.org/wiki/Exogenous. I acknowledge the >>> gobbledygood point, and the word “exogenous” is probably one only an >>> economist could love. But I welcome comments on whether the idea is >>> correct – data fields that are generated or observable outside of >the >>> company are different from those generated within the company’s >>system. >>> >>> >>> >>> 4. If exogenous/endogenous are correct in theory, but gobbledygook >>in >>> practice, then I wonder if there are plain language words that get >at >>> the same idea. My best current attempt is that red to yellow means >>> scrubbing fields that are “observable from outside of the company” >or >>> “outwardly observable.” >>> >>> >>> >>> _So, my suggestion is that red to yellow means scrubbing fields that >>> are “observable from outside of the company” or “outwardly >>observable.”_ >>> >>> >>> >>> If this is correct, then the concept of k-anonymity likely remains >>> relevant. Keeping broad demographic information such as male/female >>> or age group can be in the yellow zone. However, a left-handed >>person >>> under five feet with red hair would in most settings be a bucket too >>> small. >>> >>> >>> >>> Clearly, the group has a variety of issues to address if we decide >to >>> go with a three-part R/Y/G approach to de-identification. The >>limited >>> goal of this post is to try to help with terminology. Is it useful >>to >>> say that the yellow zone means scrubbing data that is “observable >>from >>> outside of the company”, except for broad demographic data? >>> >>> >>> >>> Peter >>> >>> >>> P.S. After I wrote the above, I realized that "observable from >>> outside of the company" is similar in meaning to what can be >>"tracked" >>> by those outside of the company. So scrubbing those items plausibly >>> reduces tracking, at least by the other companies. >>> >>> >>> >>> Prof. Peter P. Swire >>> C. William O'Neill Professor of Law >>> Ohio State University >>> 240.994.4142 >>> www.peterswire.net >>> >>> Beginning August 2013: >>> Nancy J. and Lawrence P. Huang Professor >>> Law and Ethics Program >>> Scheller College of Business >>> Georgia Institute of Technology >>>
Received on Monday, 24 June 2013 10:42:16 UTC