- From: Dan Auerbach <dan@eff.org>
- Date: Sun, 23 Jun 2013 22:50:54 -0700
- To: public-tracking@w3.org
- Message-ID: <51C7DE3E.1050509@eff.org>
Hi Peter, I want to highlight that we still have a lot of work to do if we are to come to agreement on the de-identification issue. I think by moving from a 2-state to a 3-state de-identification process, we shifted where the disagreement was, but we didn't really resolve it. I am open to a 3-state process, but think that 2 state is far simpler, and so would suggest if we are to have a last effort at trying to come to agreement, we return to the 2 state model and tackle the de-identification question head on. I will re-propose language for this for the June draft. Regarding the 3 state proposal, and setting aside nomenclature, we lack a clear definition of the "yellow" state both in terms of what standard data must meet before it is "yellow", and the extent to which "yellow" data is granted special privileges in terms of how it can be used and passed around. The more that yellow data is considered out of scope for DNT -- OK to keep for long stretches, OK to share with others -- the more rigorous we must be in ensuring that it is truly de-identified and unlinkable. With respect to attempts to define the yellow data, I have a few comments. First, de-identification is commonly understood to be a property of data. Adding things like "operational controls" into the mix muddies the waters, since now we are talking not about the data itself, but about its administration. If we go down this more complicated road, we have to be crystal clear about what such controls look like, and explicit that the data itself is in a state with significant re-identification risk. Next, I want very much to re-emphasize what Mike has said -- hashing a pseudonym to create another pseudonym is a null change. In the case of cookies, most attackers will not have access to the raw cookie string of other users, so going from 1234 to ABCD is not an improvement at all for privacy. Other suggested data manipulation strategies, like discarding IP addresses in favor of geolocations, do have an effect in helping reduce the privacy risk of a data set. But just suggesting a few ad hoc techniques does not amount to a standard, and without much more clarity about what the exact proposals entail, I don't think we can evaluate the extent to which the privacy risk is reduced. Finally, to address the idea of exogenous vs endogenous, perhaps this could be a promising direction, but I'll confess I'm not so sure of what you have in mind. For example, you write "data fields that are generated or observable outside of the company" would be exogenous, but wouldn't this include all user data, which is generated outside of the company? If an IP address comes in, but is reduced to a coarse geo location, is that exogenous or endogenous? Instead of trying to invent the conceptual tools to draw these lines, I think it's wise to borrow from the rich set of academic literature focused on these exact questions. As examining privacy risks of large data sets is a fairly new and rapidly evolving field, I think we shouldn't tie ourselves to any one particular technique, but this is still the right place to look for meaningful technical distinctions. That's why I favor non-normative language that stresses that data has no reasonable re-identification (or attribute disclosure) risk, while using non-normative examples that borrow from literature and use rigorous frameworks like k-anonymity. Dan On 06/22/2013 08:29 AM, Peter Swire wrote: > > If the group decides to use a Red/Yellow/Green approach, one question > has been how to describe the three stages. On the one hand, this may > seem trivial because the substance means more than the name. On the > other hand, in my view, the names/descriptions are potentially > important for two reasons: (1) they provide intellectual clarity about > whatgoes in each group; and (2) they communicate the categories to a > broader audience. > > > > I was part of a briefing that Shane did on Friday on the phone to FTC > participants including Ed Felten and Paul Ohm. The briefing was > similar to the approach Shane described at Sunnyvale. In the move > from red to yellow, here were examples of what could be scrubbed: > > > > 1. Unique IDs, to one-way secret hash. > > 2. IP address, to geo data. > > 3. URL cleanse, remove suspect query string elements. > > 4. Side facts, remove link out data that could be used to reverse > identify the record. > > > > Here are some ways that I’ve thought to describe what gets scrubbed, > based on this sort of list: > > > > 1. Remove identifiers (name) and what have been called > pseudo-identifiers in the deID debates (phone, passwords, etc.). But > I don’t think there is a generally accepted way to decide what > pseudo-identifiers would be removed. > > > > 2. Earlier, I had suggested “direct” and “indirect” identifiers, but > I agree with Ed’s objection that these definitions are vague. > > > > 3. I am interested in the idea that going from red to yellow means > removing information that is “exogenous” to the system operated by the > company. That is, for names/identifiers/data fields that are used > outside of the company, scrub those. Going to green would remove > information that is “endogenous” to the system operated by the > company, that is, even those within the company, with access to the > system, could no longer reverse engineer the scrubbing. > > > > When I suggested those terms on the call, someone basically said the > terms were academic gobbledygook. The terms are defined here: > http://en.wikipedia.org/wiki/Exogenous. I acknowledge the > gobbledygood point, and the word “exogenous” is probably one only an > economist could love. But I welcome comments on whether the idea is > correct – data fields that are generated or observable outside of the > company are different from those generated within the company’s system. > > > > 4. If exogenous/endogenous are correct in theory, but gobbledygook in > practice, then I wonder if there are plain language words that get at > the same idea. My best current attempt is that red to yellow means > scrubbing fields that are “observable from outside of the company” or > “outwardly observable.” > > > > _So, my suggestion is that red to yellow means scrubbing fields that > are “observable from outside of the company” or “outwardly observable.”_ > > > > If this is correct, then the concept of k-anonymity likely remains > relevant. Keeping broad demographic information such as male/female > or age group can be in the yellow zone. However, a left-handed person > under five feet with red hair would in most settings be a bucket too > small. > > > > Clearly, the group has a variety of issues to address if we decide to > go with a three-part R/Y/G approach to de-identification. The limited > goal of this post is to try to help with terminology. Is it useful to > say that the yellow zone means scrubbing data that is “observable from > outside of the company”, except for broad demographic data? > > > > Peter > > > P.S. After I wrote the above, I realized that "observable from > outside of the company" is similar in meaning to what can be "tracked" > by those outside of the company. So scrubbing those items plausibly > reduces tracking, at least by the other companies. > > > > Prof. Peter P. Swire > C. William O'Neill Professor of Law > Ohio State University > 240.994.4142 > www.peterswire.net > > Beginning August 2013: > Nancy J. and Lawrence P. Huang Professor > Law and Ethics Program > Scheller College of Business > Georgia Institute of Technology >
Received on Monday, 24 June 2013 05:51:26 UTC