Re: Deidentification (ISSUE-188)

On Jul 18, 2014, at 2:38 AM, TOUBIANA Vincent wrote:

> Hi Roy,
>  
> I should have replaced “record” by “transaction” which would have make things more clear. In the example you give, all the record would be considered as one transaction thus solving the problem.

That doesn't make it more clear.  You are just replacing one commonly used
database term by another term that is used in both databases and commerce.
Depending on who you talk to, a transaction could be a single operation, a
set of related operations, or any large number of operations that eventually
result in an exchange of goods.

None of which has anything to do with linkability.

> With the  property on linkability, in an anonymized dataset you should not be able to link two transactions from the same device and you should not be able to link a transaction to another dataset.

No.  That is simply wrong.  All session-based interactions with users
depend on the linking of multiple interactions over time, each of which
must remain linked in the dataset if the site is going to make any
meaningful use of them.  Linking data records doesn't have anything to
do with privacy or EU data protection.

That does not, in any way, imply that the data set remains linked to
the user, which is what linkability means to data protection. (Linking
to the user's device is just an indirect linking to the user).
The de-identified data can remain linked together as related interactions
after the identifying data has been removed from all records, which
includes removal of information in the dataset that might be unique
to a small set of users (queries, real times, etc.).

> In the Article 29 Opinion, linkability is defined as “the ability to link, at least, two records concerning the same data subject or a group of data subjects (either in the same database or in two different databases). “

The Article 29 definition of linkability is simply wrong: it seems
to be entirely misinformed about what that term means with regard to
data protection. Maybe that's why we are calling it de-identification
instead.

> . I did not adapted this definition to the DNT context correctly, here is a more suitable definition:
>  
> A data-set is de-identified when it is no longer possible to:
> - isolate some or all transactions which correspond to a device or user,
> - link, two transaction concerning the same device or user (either in the same database or in two different databases),
> - deduce, with significant probability, information about a user or device.
>  
> Thank you for your feedback.

You are trying to prevent identifying a user via data correlation and
have construed the definition as if that is all that matters.
As a side-effect, you are preventing normal operation of a site
in terms of evaluating which user agent software doesn't work well,
or where to place UI elements in a window, or what sets of content
lead to a conversion (as opposed to boredom or leaving the site).

My proposed text says that the dataset is de-identified when it cannot
be used to identify a particular user.  How it might be used to do so
is irrelevant -- any mechanism applies, including data correlation.
How the data is constructed doesn't matter.  How it is combined with
other datasets doesn't matter.  The only thing that matters is whether
the dataset is capable of revealing anything about a particular user.

....Roy

Received on Friday, 18 July 2014 18:14:20 UTC