Fwd: definition of "unlinkable data" in the Compliance spec from Ed Felten on 2012-09-18 (public-tracking@w3.org from September 2012)

From: Ed Felten <ed@felten.com>
Date: Tue, 18 Sep 2012 11:05:35 -0400
To: "<public-tracking@w3.org>" <public-tracking@w3.org>
Message-ID: <CANZBoGh8N0eTRxG37-YEt9TNOuvY+xFHVAmOOqRZRVxXzbsb2Q@mail.gmail.com>

Sorry to repost this, but nobody has answered any of my questions about
Option 1 for the unlinkability definition.

Note to proponents of Option 1 (if any): If nobody can explain or clarify
Option 1, that will presumably be used as an argument against Option 1 when
decision time comes.

---------- Forwarded message ----------
From: Ed Felten <ed@felten.com>
Date: Thu, Sep 13, 2012 at 5:03 PM
Subject: definition of "unlinkable data" in the Compliance spec
To: "<public-tracking@w3.org>" <public-tracking@w3.org>

I have some questions about the Option 1 definition of "Unlinkable Data",
section 3.6.1 in the Compliance spec editor's draft.   The definition is as
follows [fixing typos]:

A party renders a dataset unlinkable when it:
1. takes commercially reasonable steps to de-identify data such that there
is confidence that it contains information which could not be linked to a
specific user, user agent, or device in a production environment
[2. and 3. aren't relevant to my questions]

I have several questions about what this means.
(A) Why does the definition talk about a process of making data unlinkable,
instead of directly defining what it means for data to be unlinkable?  Some
data needs to be processed to make it unlinkable, but some data is
unlinkable from the start.  The definition should speak to both, even
though unlinkable-from-the-start data hasn't gone through any kind of
process.  Suppose FirstCorp collects data X; SecondCorp collects X+Y but
then runs a process that discards Y to leave it with only X; and ThirdCorp
collects X+Y+Z but then minimizes away Y+Z to end up with X.  Shouldn't
these three datasets be treated the same--because they are the same
X--despite having been through different processes, or no process at all?
(B) Why "commercially reasonable" rather than just "reasonable"?  The term
"reasonable" already takes into account all relevant factors.  Can somebody
give an example of something that would qualify as "commercially
reasonable" but not "reasonable", or vice versa?  If not, "commercially"
only makes the definition harder to understand.
(C) "there is confidence" seems to raise two questions.  First, who is it
that needs to be confident?  Second, can the confidence be just an
unsupported gut feeling of optimism, or does there need to be some valid
reason for confidence?  Presumably the intent is that the party holding the
data has justified confidence that the data cannot be linked, but if so it
might be better to spell that out.
(D) Why "it contains information which could not be linked" rather than the
simpler "it could not be linked"?  Do the extra words add any meaning?
(E) What does "in a production environment" add?  If the goal is to rule
out results demonstrated in a research environment, I doubt this language
would accomplish that goal, because all of the re-identification research I
know of required less than a production environment.  If the goal is to
rule out linking approaches that aren't at all practical, some other
language would probably be better.

(I don't have questions about the meaning of Option 2; which shouldn't be
interpreted as a preference for or against Option 2.)

Received on Tuesday, 18 September 2012 15:06:25 UTC