RE: definition of "unlinkable data" in the Compliance spec

Let me rephrase the question slightly: what is your threat model?  Who is trying to obtain what, and what are they willing to spend?

Allan Schiffman expressed it very nicely some years ago (http://marginalguesswork.blogspot.com/2004/07/instant-immortality.html):  "Amateurs study cryptography; professionals study economics."  How much effort do you think various people will put into linking -- deanonymizing -- data?    Unsalted hashes are, as noted, pretty trivial to invert in many cases of interest here..  Salted hashes or encrypted PII?  Who will hold the salt or key?  I assume we're not worried about special operations forces making midnight raids on data centers -- but how many {dollars, euros, yen, zorkmids} is a reidentified record worth?  That translates very directly to how many microseconds of compute time it's worth to make the effort.  Or -- suppose that you have a file with 10,000,000 unlinkable records.  What is it worth to reidentify all 10,000,000?  1,000,00 random records? 100,00 random records?  One particular one that you think may be of interest for some particular reason?  What other resources can the adversary bring  to bear?

Are folks' mental models of the threat that different?

-----Original Message-----
From: Rigo Wenning [mailto:rigo@w3.org] 
Sent: Sunday, September 23, 2012 3:32 PM
To: public-tracking@w3.org
Cc: Ed Felten
Subject: Re: definition of "unlinkable data" in the Compliance spec

Ed, 

On Thursday 13 September 2012 17:03:09 Ed Felten wrote:
> I have several questions about what this means.
> (A) Why does the definition talk about a process of making data 
> unlinkable, instead of directly defining what it means for data to be 
> unlinkable?  Some data needs to be processed to make it unlinkable, 
> but some data is unlinkable from the start.  The definition should 
> speak to both, even though unlinkable-from-the-start data hasn't gone 
> through any kind of process.  Suppose FirstCorp collects data X; 
> SecondCorp collects
> X+Y but then runs a process that discards Y to leave it with only
> X; and ThirdCorp collects X+Y+Z but then minimizes away Y+Z to end up 
> with X.  Shouldn't these three datasets be treated the same--because 
> they are the same X--despite having been through different processes, 
> or no process at all?

for the data protection people like me, unlinkable data is not part of the scope of data protection measures or "privacy" if you want. 
It is therefore rather natural to only talk about linkable data in our Specifications meaning data linked to a person. And only address what to do with that linkable data and its link to a person. This may encompass a definition of what makes data "linkable". But it would go too far to define what's "unlinkable". Having done research about data being "unlinkable" (Slim Trabelsi/SAP has created a nice script to determine the entropy allowing for de-anonymization), a definition of "unlinkable data" would import that scientific dispute into the Specification. I would not really like that to happen as it would mean another point of endless debate. You can see this already happening in the thread following your message. 

Received on Monday, 24 September 2012 22:39:35 UTC