Re: Deidentification (ISSUE-188) from Justin Brookman on 2014-08-20 (public-tracking@w3.org from August 2014)

From: Justin Brookman <jbrookman@cdt.org>
Date: Wed, 20 Aug 2014 18:30:10 -0400
To: David Singer <singer@apple.com>
Cc: "public-tracking@w3.org WG" <public-tracking@w3.org>, "Roy T. Fielding" <fielding@gbiv.com>, TOUBIANA Vincent <vtoubiana@cnil.fr>
Message-Id: <3E9EE174-4F67-4976-B5C7-9A50C0EAE193@cdt.org>
On Aug 18, 2014, at 12:12 PM, David Singer <singer@apple.com> wrote:

> 
> On Aug 17, 2014, at 6:35 , Mike O'Neill <michael.oneill@baycloud.com> wrote:
> 
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>> 
>>>> b) the deidentification measures should be described in a form that is at least as available as the data (i.e. publicly, if the data itself will be made public).
>> 
>> Why not publicly any every case? If someone collects DNT data and intends to share privately it amongst their friends we should know how they shred the PII out of it.
> 
> OK.  The term {permanently deidentified} below is a candidate for being replaced by a new name of our choosing (e.g. “permanent non-tracking data”), here and where it is used.  How is this?  I made the second clause not a note, as it contains ‘should’ and ‘strongly recommended’ i.e. it is not merely informative.
> 
> * * * * *
> 
> Data is {permanently de-identified} (and hence out of the scope of this specification) when a sufficient combination of technical measures and restrictions ensures that the data does not, and cannot and will not be used to, identify a particular user, user-agent, or device.
> 
> In the case of dataset that contain records that relate to a single user or a small number of users:
> a) Usage and/or distribution restrictions are strongly recommended; experience has shown that such records can, in fact, sometimes be used to identify the user(s) despite the technical measures that were taken to prevent that happening.
> b) the deidentification measures should be described publicly (e.g. in the privacy policy).

OK, we agreed to present this as an option at the Call for Objection today.  I was not following IRC that closely today, but Nick indicated that Roy or Vincent may be satisfied with this option as well, or might be willing to withdraw their proposals as well.  Roy, Vincent, let me know what you want to do, and we can proceed to a CfO on this issue.

> 
> 
>> 
>> Mike
>> 
>> 
>>> -----Original Message-----
>>> From: David Singer [mailto:singer@apple.com]
>>> Sent: 15 August 2014 23:35
>>> To: <public-tracking@w3.org>
>>> Cc: Mike O'Neill; Justin Brookman; rob@blaeu.com
>>> Subject: Re: Deidentification (ISSUE-188)
>>> 
>>> 
>>> On Aug 14, 2014, at 16:04 , Rob van Eijk <rob@blaeu.com> wrote:
>>> 
>>>> 
>>>> If the definition gets adopted, wouldn't it be fair to the user to include text
>>> with a normative MUST for a party to provide detailed information about the
>>> details of the de-identification process(es) it applies? Transparency should do it's
>>> work to prevent "de-identification by obscurity".
>>>> 
>>>> Is the group willing to consider such a normative obligation?
>>>> 
>>> 
>>> On Aug 15, 2014, at 9:25 , Lee Tien <tien@eff.org> wrote:
>>> 
>>>> EFF agrees: transparency in de-identification methods is very important and is
>>> far superior for users than the old-school "expert certification without showing
>>> your work" approach.
>>>> 
>>> 
>>> 
>>> I can’t answer for the group, but there are a few points to ponder.
>>> 
>>> It could be a best practice to describe what you do, especially in the case of data
>>> sets that have per-user records.  Researchers love to critique those.  (See
>>> below).
>>> 
>>> But, on the other hand, there are a myriad ways in which data that was
>>> identifiable gets deidentified.  How far do they have to trace it, and how many
>>> ways?
>>> 
>>> "We count the number of visitors coming from the major web browsers, as
>>> aggregate counts.  Separately, we log the US state, or country, and visit date
>>> (but not time) of every visitor.  We keep separate aggregate buckets of the
>>> number of visitors we estimate to be aged 0-16 years old, 16-21, 21-30, 31-50,
>>> and 50+.  For every visit, we record the date/time that an ad was served, and
>>> what ad was served (this is the only database with per-visit records). [[and so
>>> on]]"
>>> 
>>> It sounds as thought you are supportive of the text, but want an additional
>>> requirement for some kinds (all kinds?) of data.  Can you express what that is?
>>> Perhaps added to the note on per-user datasets?  I give it a try below.
>>> 
>>> 
>>> On Aug 15, 2014, at 9:07 , Mike O'Neill <michael.oneill@baycloud.com> wrote:
>>> 
>>>> As I said, I do not think the old definition of de-identified works for the third-
>>> party compliance section (or any statement describing data as out-of-scope of
>>> DNT). It assumes that identifying (tracking) data has been collected and some
>>> process other than deletion can be applied to it to make it safe.
>>> 
>>> That is one of the cases, but in general yes, the use of the term is only of interest
>>> to us to describe what happened to in-scope data to make it out-of-scope.  We
>>> are not interested in data that was never in scope, and we handle data that
>>> remains in scope elsewhere.
>>> 
>>>> I suggested we use a new definition for out-of-scope e.g. anonymous data
>>> (mathematically impossible to derive identity from it, or being linked to an
>>> individual in a subsequent network interaction), and leaving the definition of the
>>> de-identifying process for the permitted use section (data collected unknowingly
>>> in error should just be deleted).
>>> 
>>> I don’t mind what term we use for it.  We can invent our own new word if we
>>> like (‘noa’). It’s the concept we need to nail down.  I suggest a new phrase
>>> below.
>>> 
>>>> I agree your "data does not, and cannot and will not " implies impossibility,
>>> and the dreaded "reasonable" has gone which is good. Though the non-
>>> normative bit counteracts that somewhat by calling for distribution restrictions
>>> (which are not needed if the data "cannot" be re-identified).
>>> 
>>> You ‘cannot’ because it’s both believed impossible and you are not allowed to
>>> try (some suitable combination).  The note explains that you probably want to be
>>> restrictive on datasets that contain per-user records.  The ‘cannot’ is reflecting
>>> both the lack of an ability (possibility) and the lack of permission.
>>> 
>>>> I agree with Rob that a new definition would probably be superfluous given
>>> our definition of tracking implying in-scope data as : "..  data regarding a
>>> particular user's activity across multiple distinct contexts".
>>>> 
>>>> The problem I have is that with the other-contexts qualification machine
>>> discoverability becomes tricky.  This could create a loophole if collected data
>>> with a UID is out-of-scope  when the controller promises to wear tunnel-vision
>>> glasses.
>>> 
>>> If it’s possible (by looking up the UID in some dataset) then I don’t think the data
>>> is deidentified.  That’s like saying I don’t have a martini because I keep the gin
>>> and vermouth separate.
>>> 
>>> 
>>> * * * *
>>> 
>>> Actually, Mike’s point that it doesn’t apparently correspond to the definition of
>>> tracking is well-taken. On the face of it, it should say that the data can no longer
>>> associate the user with another context; but of course, you are about to give
>>> the data away to another context, or publicly and hence to all other contexts,
>>> and the data is (by virtue of its origins) associated with your context as its origin.
>>> The only way to have it not associate the user with a context that is not the
>>> recipient is to have it not identify the user at all, which is what we have.  Here I
>>> re-state with an attempt to respond to Rob and Lee:
>>> 
>>> * * * *
>>> 
>>> Data is permanently de-identified (and hence out of the scope of this
>>> specification) when a sufficient combination of technical measures and
>>> restrictions ensures that the data does not, and cannot and will not be used to,
>>> identify a particular user, user-agent, or device.
>>> 
>>> Note: In the case of dataset that contain records that relate to a single user or a
>>> small number of users:
>>> a) Usage and/or distribution restrictions are strongly recommended;
>>> experience has shown that such records can, in fact, sometimes be used to
>>> identify the user(s) despite the technical measures that were taken to prevent
>>> that happening.
>>> b) the deidentification measures should be described in a form that is at least as
>>> available as the data (i.e. publicly, if the data itself will be made public).
>>> 
>>> * * * *
>>> 
>>> Would people prefer a term like “permanent non-tracking data” for this
>>> definition, and not (re-) or (ab-) use the existing term “deidentified”?
>>> 
>>> 
>>> David Singer
>>> Manager, Software Standards, Apple Inc.
>>> 
>> 
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.13 (MingW32)
>> Comment: Using gpg4o v3.3.26.5094 - http://www.gpg4o.com/
>> Charset: utf-8
>> 
>> iQEcBAEBAgAGBQJT8K+jAAoJEHMxUy4uXm2J7U8IAJfFMqhvQwZAxpe1boGNyeZF
>> 9Azn8SKFIBSmdz/serlyaUm7WSH7gbXNXGfNDIPIfNqSL4OkZat0d82ubnEgzdLQ
>> 6MR6aYNFVJunmIKoAIsrVRCSFWcTsVqV46Jmu8nDl7gfCW40xyhICiNpn5JgoWx7
>> lIveKHuxp2ZxOsEJTbFU9l41YQwq2FHMfHMhvTN0YEQ2bweBe2BpZztOdAMYX6wi
>> s5EqkI9LBzB2twJGtgDCNGFn29HKAjTjI7XNdaQzNGB26HTH2iPRZ1ZWtCcuUd9l
>> 34Sdu+ardU8hMnRm+wahRl2TgdTPROJ/xOC0z8A/63dHSJgEapInCceQWziIw+E=
>> =/ZgD
>> -----END PGP SIGNATURE-----
>> 
>> 
> 
> David Singer
> Manager, Software Standards, Apple Inc.
> 
>
Received on Wednesday, 20 August 2014 22:30:34 UTC