RE: Proposal from Big Basin break out from Shane Wiley on 2013-05-11 (public-tracking@w3.org from May 2013)

From: Shane Wiley <wileys@yahoo-inc.com>
Date: Sat, 11 May 2013 20:49:38 +0000
To: Kevin Kiley <kevin.kiley@3pmobile.com>, "public-tracking@w3.org" <public-tracking@w3.org>
CC: "walter.van.holst@xs4all.nl" <walter.van.holst@xs4all.nl>, Brad Kulick <kulick@yahoo-inc.com>
Message-ID: <DCCF036E573F0142BD90964789F720E314076F14@GQ1-MB01-02.y.corp.yahoo.com>
Kevin,

While the tri-state de-identification scheme does not dictate specific IP Address replacement guiderails, I believe the "reasonable" tenant is the one to focus on here.  For example, if IP Address is replaced with Postal Code (5 digit, not 9 digit) then I believe most record sets would continue to be deemed de-identified.  But let's say another team is looking only a hyper location of data subset and the record set contains only the de-identified ID (separate key from other systems) and the lat/long for that ID.  With only these data points, a team can look at the frequency of events and geo-spacial clusters overtime, but would not have the means to reverse identify the data set as no side facts/data exist.  It's this type of balance that is difficult to prescriptively outline upfront and why standards focus on principles and allow innovation to occur within those boundaries.

- Shane

From: Kevin Kiley [mailto:kevin.kiley@3pmobile.com]
Sent: Saturday, May 11, 2013 1:32 PM
To: public-tracking@w3.org
Cc: walter.van.holst@xs4all.nl; Brad Kulick; Shane Wiley; Kevin Kiley
Subject: Re: Proposal from Big Basin break out

I think Walter is right to raise the issue of 'granularity' for geo data replacing IP addresses
in (supposedly) de-identified data and this needs more discussion. See comments (inline) below.

> On May 8, 2013, at 10:58 AM, Walter van Holst wrote:
>
>> Dear Brad,
>>
>> If I understand the document correctly, IP-addresses are 'de-identified' based on geolocation.
>> What would the lower floor of the granularity of such geolocation be?
>> Regards,
>> Walter
>
> Brad Kulick ( Yahoo ) responded...
>
>> Walter,
>> We did not explicitly discuss this point. Nor was there consideration to be prescriptive in this area.

Yet, after the Big Basin breakout, Shane Wiley (Yahoo) did report back to the group at the Sunnyvale F2F
that the 'level of granularity' HAD (apparently) been discussed.

Minutes from Sunnyvale F2F Day 3, following Big Basin breakout...
http://www.w3.org/2013/05/08-dnt-minutes

Shane said ( from the microphone to the general assembly )...

[snip]

Shane Wiley (Yahoo): Next step - remove IP and replace with *BROAD* geo data .

[/snip]

So this goes back to Walter's original question.

What did Shane mean by *BROAD* geo data (only)?

Country codes only? Postal codes only?... NEVER any Latitude/Longitude?

Needs clarification, obviously.

>> Brad Kulick (Yahoo) also wrote...
>>
>> The intention is that IP address is completely removed/replaced with geo data.

'Completely removed' is good... still not sure about 'replacing' it with ANYTHING. See additional concerns below.

>> The granularity of the geo data would be determined with relation to the risk of re-identification that should be managed by the data controllers.
>> Thanks,
>> Brad ( Kulick ) ( Yahoo )

I believe the conversion of IP address(es) to 'geo data' of almost ANY granularity creates a significant 'risk of re-identification', or
at least creates a direct violation of BOTH of the pending 'Deidentified Data' definitions in the current TCS.

>From the latest (published) Working Draft of the 'Tracking Compliance and Scope' ( TCS ) deliverable...

Published April 30, 2013
http://www.w3.org/2011/tracking-protection/

[snip]

3.7 Deidentified Data

OPTION 1
Data is deidentified when a party:
(1) has taken measures to ensure with a reasonable level of justified confidence that the data cannot be used to infer information about,
or otherwise be linked to, a particular consumer, computer, or other device;
(2) does not to try to reidentify the data; and
(3) contractually prohibits downstream recipients from trying to re-identify the data.

OPTION 2
Data can be considered sufficiently deidentified to the extent that it has been deleted, modified, aggregated, anonymized or otherwise
manipulated in order to achieve a reasonable level of justified confidence that the data cannot reasonably be used to infer information
about, or otherwise be linked to, a particular user, user agent, or device.

Note(s):

The first option above is based on the definition of unlinkable data in the 2012 FTC privacy report;
the second option was proposed by Daniel Kaufman. The group has a fundamental disagreement about whether internal
access controls within an organization could be sufficient to de-identify data for the purposes of this standard.

Issue 188: Definition of unlinkable data

Issue 191: Non-normative Discussion of De-Identification

[/snip]

The latest 'proposal diagram' for de-identification posted by Brad Kulick ( Yahoo ) on May 8, 2013...
http://lists.w3.org/Archives/Public/public-tracking/2013May/att-0045/Proposal_rev_2.pdf

[snip]

Paramount rules...

1. Once a record is de-identified it can never be re-ID'd
2. You can never create a mapping between raw and de-identified records

Steps...

1. Unique Ids
    a. One-way secret Hash
2. IP Address
    a. Replace w/geo data
3. URL cleanse
    a. Filter user specific clues
4. Side facts
    a. Remove elements that assist reverse ID
5. Unlink via 2nd application of one-way hash with salt/key #2, destroy salt/key #2 on some interval

Noteworthy: Accountability is required.

[/snip]

If step number 2a is allowed ( replace IP address with geo data rather than just REMOVE IP address ) then
this (potentially) breaks 'Paramount rule 1' in the new Wiley/Kulick proposal, according to either one of
the (current) optional definitions of 'de-identified data' currently codified in the TCS.

If the granularity of the geo data is not sufficiently restricted... then, at any time, the (supposedly)
de-identified data can still (easily) be linked to 'a specific computer or device', depending on the
realties of the underlying connection details.

If the accepted definition of de-identified data becomes OPTION 2... then it most certainly would
ALSO violate the 'used to infer information' clause of that definition under ANY circumstances.

Yours;
Kevin Kiley
Received on Saturday, 11 May 2013 20:50:20 UTC