Re: definition of "unlinkable data" in the Compliance spec from Joseph Lorenzo Hall on 2012-09-21 (public-tracking@w3.org from September 2012)

From: Joseph Lorenzo Hall <joe@cdt.org>
Date: Fri, 21 Sep 2012 10:11:28 -0400
To: Lauren Gelman <gelman@blurryedge.com>
CC: Ed Felten <ed@felten.com>, "<public-tracking@w3.org>" <public-tracking@w3.org>
Message-ID: <505C7590.6080305@cdt.org>
That's certainly shorter and clearer, it and addresses a bunch of Ed's 
concerns!

As for identifiable user agents, security experts think that it's 
essentially impossible (or very, very hard) to avoid having identifiable 
user agents (see the EFF's panopticlick work and all the crazy stuff 
TorBrowser has to do to reduce risks of browser fingerprinting).

I wonder about "reasonable" here (likely used elsewhere in the specs)... 
are developers going to know what a lack of a "reasonable association" 
between data and person/UA means?

Apologies in advance for asking questions and not proposing a solution! 
best, Joe


On 9/20/12 7:22 PM, Lauren Gelman wrote:
>
> Unlinkable data is data that cannot reasonably be associated with an
> identifiable person or user agent.
>
> Lauren Gelman
> BlurryEdge Strategies
> 415-627-8512
>
> On Sep 18, 2012, at 8:05 AM, Ed Felten wrote:
>
>> Sorry to repost this, but nobody has answered any of my questions
>> about Option 1 for the unlinkability definition.
>>
>> Note to proponents of Option 1 (if any): If nobody can explain or
>> clarify Option 1, that will presumably be used as an argument against
>> Option 1 when decision time comes.
>>
>> ---------- Forwarded message ----------
>> From: *Ed Felten* <ed@felten.com <mailto:ed@felten.com>>
>> Date: Thu, Sep 13, 2012 at 5:03 PM
>> Subject: definition of "unlinkable data" in the Compliance spec
>> To: "<public-tracking@w3.org <mailto:public-tracking@w3.org>>"
>> <public-tracking@w3.org <mailto:public-tracking@w3.org>>
>>
>>
>> I have some questions about the Option 1 definition of "Unlinkable
>> Data", section 3.6.1 in the Compliance spec editor's draft.   The
>> definition is as follows [fixing typos]:
>>
>> A party renders a dataset unlinkable when it:
>> 1. takes commercially reasonable steps to de-identify data such that
>> there is confidence that it contains information which could not be
>> linked to a specific user, user agent, or device in a production
>> environment
>> [2. and 3. aren't relevant to my questions]
>>
>> I have several questions about what this means.
>> (A) Why does the definition talk about a process of making data
>> unlinkable, instead of directly defining what it means for data to be
>> unlinkable?  Some data needs to be processed to make it unlinkable,
>> but some data is unlinkable from the start.  The definition should
>> speak to both, even though unlinkable-from-the-start data hasn't gone
>> through any kind of process.  Suppose FirstCorp collects data X;
>> SecondCorp collects X+Y but then runs a process that discards Y to
>> leave it with only X; and ThirdCorp collects X+Y+Z but then minimizes
>> away Y+Z to end up with X.  Shouldn't these three datasets be treated
>> the same--because they are the same X--despite having been through
>> different processes, or no process at all?
>> (B) Why "commercially reasonable" rather than just "reasonable"?  The
>> term "reasonable" already takes into account all relevant factors.
>>  Can somebody give an example of something that would qualify as
>> "commercially reasonable" but not "reasonable", or vice versa?  If
>> not, "commercially" only makes the definition harder to understand.
>> (C) "there is confidence" seems to raise two questions.  First, who is
>> it that needs to be confident?  Second, can the confidence be just an
>> unsupported gut feeling of optimism, or does there need to be some
>> valid reason for confidence?  Presumably the intent is that the party
>> holding the data has justified confidence that the data cannot be
>> linked, but if so it might be better to spell that out.
>> (D) Why "it contains information which could not be linked" rather
>> than the simpler "it could not be linked"?  Do the extra words add any
>> meaning?
>> (E) What does "in a production environment" add?  If the goal is to
>> rule out results demonstrated in a research environment, I doubt this
>> language would accomplish that goal, because all of the
>> re-identification research I know of required less than a production
>> environment.  If the goal is to rule out linking approaches that
>> aren't at all practical, some other language would probably be better.
>>
>> (I don't have questions about the meaning of Option 2; which shouldn't
>> be interpreted as a preference for or against Option 2.)
>>
>>
>

-- 
Joseph Lorenzo Hall
Senior Staff Technologist
Center for Democracy & Technology
1634 I ST NW STE 1100
Washington DC 20006-4011
(p) 202-407-8825
(f) 202-637-0968
joe@cdt.org
Received on Friday, 21 September 2012 14:12:03 UTC