W3C home > Mailing lists > Public > public-tracking@w3.org > September 2012

Re: definition of "unlinkable data" in the Compliance spec

From: Rigo Wenning <rigo@w3.org>
Date: Sun, 23 Sep 2012 21:32:06 +0200
To: public-tracking@w3.org
Cc: Ed Felten <ed@felten.com>
Message-ID: <2418381.vJaz8XgzHn@hegel.sophia.w3.org>

On Thursday 13 September 2012 17:03:09 Ed Felten wrote:
> I have several questions about what this means.
> (A) Why does the definition talk about a process of making data
> unlinkable, instead of directly defining what it means for data
> to be unlinkable?  Some data needs to be processed to make it
> unlinkable, but some data is unlinkable from the start.  The
> definition should speak to both, even though
> unlinkable-from-the-start data hasn't gone through any kind of
> process.  Suppose FirstCorp collects data X; SecondCorp collects
> X+Y but then runs a process that discards Y to leave it with only
> X; and ThirdCorp collects X+Y+Z but then minimizes away Y+Z to
> end up with X.  Shouldn't these three datasets be treated the
> same--because they are the same X--despite having been through
> different processes, or no process at all? 

for the data protection people like me, unlinkable data is not part 
of the scope of data protection measures or "privacy" if you want. 
It is therefore rather natural to only talk about linkable data in 
our Specifications meaning data linked to a person. And only address 
what to do with that linkable data and its link to a person. This 
may encompass a definition of what makes data "linkable". But it 
would go too far to define what's "unlinkable". Having done research 
about data being "unlinkable" (Slim Trabelsi/SAP has created a nice 
script to determine the entropy allowing for de-anonymization), a 
definition of "unlinkable data" would import that scientific dispute 
into the Specification. I would not really like that to happen as it 
would mean another point of endless debate. You can see this already 
happening in the thread following your message. 

Just asking for data to be "unlinkable" leaves the art of making 
that happen with every little webmaster in this world instead of 
using the expertise being here to find the right compromise between 
effort of anonymization and privacy threat involved. 

> (B) Why "commercially
> reasonable" rather than just "reasonable"?  The term "reasonable"
> already takes into account all relevant factors.  Can somebody
> give an example of something that would qualify as "commercially
> reasonable" but not "reasonable", or vice versa?  If not,
> "commercially" only makes the definition harder to understand.

Yes, I think "commercially" is definitely an accident in that 
definition. Especially as in a democratic society, commercial 
companies are allowed to be commercially unreasonable.

> (C) "there is confidence" seems to raise two questions.  First,
> who is it that needs to be confident?  Second, can the confidence
> be just an unsupported gut feeling of optimism, or does there
> need to be some valid reason for confidence?  Presumably the
> intent is that the party holding the data has justified
> confidence that the data cannot be linked, but if so it might be
> better to spell that out.

I think the "confidence" is a null/zero requirement. If someone 
easily de-anonymizes data that you were confident about, the legal 
system will chose the horizon of a "reasonable person". And by 
having light anon tools, you were not reasonable to assume 

> (D) Why "it contains information which could not be linked" rather
> than the simpler "it could not be linked"?  Do the extra words
> add any meaning? (E) What does "in a production environment" add?
>  If the goal is to rule out results demonstrated in a research
> environment, I doubt this language would accomplish that goal,
> because all of the re-identification research I know of required
> less than a production environment.  If the goal is to rule out
> linking approaches that aren't at all practical, some other
> language would probably be better.

Ed, you can link data together that is not personal data. The 
definition needs some better wording here. Because only the fact of 
linking personal data with other personal data and other data 
creates problem. The fact of linking data without personal 
connotation is just out of scope of the entire privacy concept. I 
agree that the "production environment" is meaningless. 

Received on Sunday, 23 September 2012 19:32:30 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 17:39:00 UTC