Re: Deidentification (ISSUE-188) from Rob van Eijk on 2014-08-14 (public-tracking@w3.org from August 2014)

From: Rob van Eijk <rob@blaeu.com>
Date: Fri, 15 Aug 2014 01:04:51 +0200
To: David Singer <singer@apple.com>
Cc: "Mike O'Neill" <michael.oneill@baycloud.com>, Justin Brookman <jbrookman@cdt.org>, public-tracking@w3.org
Message-ID: <5c4436a7b2c89ee5440d0a549bd782b9@xs4all.nl>
If the definition gets adopted, wouldn't it be fair to the user to 
include text with a normative MUST for a party to provide detailed 
information about the details of the de-identification process(es) it 
applies? Transparency should do it's work to prevent "de-identification 
by obscurity".

Is the group willing to consider such a normative obligation?

Rob

David Singer schreef op 2014-08-14 22:04:
> On Aug 14, 2014, at 12:20 , Mike O'Neill <michael.oneill@baycloud.com> 
> wrote:
> 
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>> 
>> I agree, using a verb assumes that you already have data about people 
>> and you apply a de-identifying process to it. It is the process that 
>> is hard to define, without leaving loopholes.
> 
> Precisely.  I am not trying to define a process; I am defining by the
> result.  The result is a set of data that will never be linked to any
> specific user, user-agent, or device, by a suitable combination of
> technical (deidentification) measures, and restrictions (e.g. you are
> not allowed to try, you are not allowed to distribute this to anyone
> unless they agree not to try, and so on).  I don’t want to define the
> technical processes or the restrictions, just the result;  the data
> never gets linked to a specific user, user-agent, or device, ever
> again.
> 
>> 
>> What is in scope is tracking data, and DNT should just mean do not 
>> collect it (unless you claim a permitted use). If you have collected 
>> in error just delete it.
>> 
>> Maybe that is all we need to say.
> 
> Maybe you need to review where we use the term again.
> 
> 1.  Third parties can collect data if
>   a) they have an exception
>   b) they have a permitted use
>   c) it’s deidentified
> 2.  Unknowing collection.  You have to render the data out of scope or
> delete it, and out of scope means you have permanently deidentified
> it.
> 3. Discussed but not yet in the spec.: the ‘raw data’ problem
> (companies cannot process raw logs in real time). Keep the raw data
> until you can process it, but the raw data has only 3 possible exits
> (like third party data):
>   a) it’s identifiable, but the user allowed you to collect it
>   b) it’s identifiable, but there is a permitted use you claim and you
> adhere to the restrictions of that permitted use
>   c) it’s not identifiable, it’s been deidentified
> 
> For all these, we need a definition of what data in a deidentified
> state means.  To me, it means it’s got detached from any given user
> (user-agent, or device) and can and/or will never be reattached.
> 
>> 
>> Mike
>> 
>> 
>>> -----Original Message-----
>>> From: Rob van Eijk [mailto:rob@blaeu.com]
>>> Sent: 14 August 2014 19:55
>>> To: David Singer
>>> Cc: Justin Brookman; public-tracking@w3.org; Mike O'Neill
>>> Subject: Re: Deidentification (ISSUE-188)
>>> 
>>> The core of my issue, which may be a symantic issue, is that the 
>>> current
>>> text is fixed on the word identification. To me it is not clear 
>>> enough
>>> from the current definition that anything else than the 'one way 
>>> street'
>>> is considered re-identification. The definition must be more specific 
>>> on
>>> this point.
>>> 
>>> Does cookie-syncing (which is commonly used in real-time bidding) 
>>> fall
>>> under the meaning of re-identification?
>>> 
>>> Rob
>>> 
>>> David Singer schreef op 2014-08-14 18:37:
>>>> Rob, I am sorry, I don’t follow you at all.
>>>> 
>>>> We say in a number of places that data passes out of our scope, and
>>>> hence we say nothing at all about it, once it has been deidentified.
>>>> We need to define what we mean by that, and we need to define that
>>>> ‘exit’ from our scope.
>>>> 
>>>> On Aug 14, 2014, at 2:08 , Rob van Eijk <rob@blaeu.com> wrote:
>>>> 
>>>>> 
>>>>> The text you propose connects the state of a permanently 
>>>>> de-identified
>>>>> dataset to the possibility of identifying a user/user-agent or 
>>>>> device.
>>>>> I think limiting the approach to identification is way too limited.
>>>>> What is not covered is for example:
>>>>> - the sharing (for e.g. data enrichment and data correlation).
>>>> 
>>>> if it doesn’t identify anyone, and won’t/can’t, we have nothing to 
>>>> say
>>>> about sharing it
>>>> 
>>>>> - the application of de-identified data to the individusl user/user
>>>>> agent/device (for e.g. re-targeting).
>>>> 
>>>> That’s re-identification, and my text says (a) it ought not be
>>>> possible and (b) it ought not be permitted
>>>> 
>>>>> - the retention of data meaning the duration of time that would be
>>>>> allowed to bring data in de-identified state.
>>>> 
>>>> That’s a separate question: the ‘raw data’ question (and one of the
>>>> exits for raw data is that the data is deidentified)
>>>> 
>>>>> - any (unintended/unforeseen) data uses that may have an impact on 
>>>>> a
>>>>> (the personal space) of a user/user agent/device. For example
>>>>> re-targeting based on de-identified data, or re-targeting based on
>>>>> correlation with de-identified data.
>>>> 
>>>> I don’t understand how one can target anyone if the data is
>>>> deidentified, and if it’s reidentified, then it wasn’t deidentified 
>>>> to
>>>> this definition (the definition insists it is a one-way street).
>>>> 
>>>>> 
>>>>> My proposal is to exclude text for de-identified data in order to 
>>>>> aim
>>>>> for a cleaner specification.
>>>> 
>>>> Again, I don’t understand.  The point of defining it is to say “how 
>>>> to
>>>> get out of the scope of this spec.”.  For example, the raw data 
>>>> clause
>>>> I proposed says there are only 3 exits:
>>>> * you have permission from the user to retain the data
>>>> * you retain the data under a permitted use, in accordance with the
>>>> terms of that permitted use
>>>> * you deidentify the data so it passes out of our scope
>>>> 
>>>> 
>>>>> 
>>>>> Rob
>>>>> 
>>>>> David Singer schreef op 2014-08-14 01:58:
>>>>>> On Aug 8, 2014, at 6:54 , Mike O'Neill 
>>>>>> <michael.oneill@baycloud.com>
>>>>>> wrote:
>>>>> (...)
>>>>>> Trying another way of phrasing it:
>>>>>> Data is permanently de-identified (and hence out of the scope of 
>>>>>> this
>>>>>> specification) when a sufficient combination of technical measures
>>>>>> and
>>>>>> restrictions ensures that the data does not, and cannot and will 
>>>>>> not
>>>>>> be used to, identify a particular user, user-agent, or device.
>>>>>> Note: Usage and/or distribution restrictions are strongly 
>>>>>> recommended
>>>>>> for any dataset that has records that relate to a single user or a
>>>>>> small number of users; experience has shown that such records can, 
>>>>>> in
>>>>>> fact, sometimes be used to identify the user(s) despite the 
>>>>>> technical
>>>>>> measures that were taken to prevent that happening.
>>>>>> David Singer
>>>>>> Manager, Software Standards, Apple Inc.
>>>> 
>>>> David Singer
>>>> Manager, Software Standards, Apple Inc.
>> 
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.13 (MingW32)
>> Comment: Using gpg4o v3.3.26.5094 - http://www.gpg4o.com/
>> Charset: utf-8
>> 
>> iQEcBAEBAgAGBQJT7QwIAAoJEHMxUy4uXm2J7vkIAOUDdIGXlCpvJw9U/KYAbjCN
>> I/T2dcIsN3Bd095aNyj+eTiC32sQ96Tc5+q//f9zLx+/CERbIy5/lOhfEQpC6z4z
>> gQuJC/Ol691owAGEQFAQEN7sZ4u5nhFFuJzhPnZILBi9tzBj4wLByxskGgf3yMyT
>> rlYi50rZpTghA4QOKvszDxAgP/hyRnk2cjWcCCjaiMWVKQh3j7aKUtit4JgU/JKb
>> ME50WRt43StzEtcaFfsPGHzwVjG/3z5wqEMWSTnwuyq68OfN8U3g0hmaDhJUzwoU
>> P5+tPJOImfOSr0H5eCIXQkKLP6sz8HSrt+HPcNrAO/uKCmIGKlD4AAqSe5Ji0gI=
>> =oQfL
>> -----END PGP SIGNATURE-----
>> 
> 
> David Singer
> Manager, Software Standards, Apple Inc.
Received on Thursday, 14 August 2014 23:05:45 UTC