Re: de-identification text for Wednesday's call from Dan Auerbach on 2013-04-03 (public-tracking@w3.org from April 2013)

From: Dan Auerbach <dan@eff.org>
Date: Wed, 03 Apr 2013 00:42:58 -0700
To: public-tracking@w3.org
Message-ID: <515BDD82.3060205@eff.org>
On 04/02/2013 10:15 PM, Shane Wiley wrote:
>
> Dan,
>
>  
>
> Thank you for being open to further discussion (hopefully not
> considered 'simply arguing').
>
>  
>
> I can't speak to your experience with companies you've worked for in
> the past but I can say that a company that is truly committed to
> Accountability (as a Privacy concept) takes technical, operational,
> and administrative controls very seriously (documentation, design
> reviews, testing, auditing, etc.).  It appears you're seeking a way to
> capture bad actors for not living up to their promises (say they
> de-identify and then due to a poorly designed de-identification
> process or operational controls they allow re-identification to occur
> -- or perhaps more likely that data reserved for a Permitted Use is
> used for a non-permitted use) -- is that correct?
>
Permitted uses are limited in the scope of data collection and retention
based on their function. On the other hand, since out of scope for DNT,
de-identified data can be retained for dozens or hundreds of years. As
such I think it is a much more dangerous avenue for abuse than a
permitted use. If by a "bad actor" you mean someone who doesn't
thoroughly de-identify data, then yes, this is my concern. But I don't
think it takes a malicious actor to become convinced that a certain
de-identification process is good enough, when it is not adequate. One
litmus test would be: if a court order forced you to turn over lots of
data to law enforcement (one of a myriad of concerns a user could have
when setting DNT:1), could law enforcement plausibly infer information
about individuals? If yes, I think that the company did not do its duty
to de-identify based on the normative text.

Perhaps privacy penetration testing might offer a way forward, if in
fact we disagree about the re-identification or attribute disclosure
risk. But we may not disagree about the risk, just the acceptable level
and type of risk. I'm not sure. Here's a concrete scenario to help sort
this out: suppose at the end of your proposed de-identification process,
you let Arvind Narayanan loose on your data. Suppose he is able to poke
holes in your methodology and infer information about individuals. Would
this require you to change your data practices? Could you still be
considered a good actor, even if Arvind could plausibly poke these
holes, and yet you aren't making any changes to the de-identification
process in response?


>  
>
> If that is correct, I don't have an easy answer for you as this same
> issue exists for almost all company privacy promises that involve
> internal management of data.  I would recommend we take the broader
> topic of 'external scrutiny of internal data practices' to a different
> thread/working group.  It's my assumption that if a W3C DNT standard
> emerges that does invite voluntary implementation by industry, that
> trade associations will step in to include this in their Codes of
> Conduct and regular audits of member compliance.  I don't see an
> immediate solution that would give EFF access to every online
> company's backend systems to self-review their practices for compliance.
>
>  
>
> I share your desire to provide meaningful guidance to implementers of
> de-identification and do this in a way as to not limit innovation in
> this area and not lock companies into solutions that harm their
> ability to operate as a business.  I don't believe your non-normative
> text strikes this balance and offered to come back to the group next
> week with a few alternate examples that better find this balance.  If
> you're open to that approach, we can open an action item for me to
> deliver next week.
>
I'm certainly happy to read your example text, and I hope we can come to
consensus about it. But I suspect we have a deep disagreement, and think
we need to have evidence-based discussions going forward. Perhaps, as
you suggest, we could have competing text in the draft document, to be
resolved later. We could also try to tackle this head on, if you and I
(and anyone interested) separately try to come to some agreement about
the text and report back to the group. Happy to take suggestions about
the best way forward, I'm really not sure.

>  
>
> Thank you,
> Shane
>
>  
>
> *From:*Dan Auerbach [mailto:dan@eff.org]
> *Sent:* Tuesday, April 02, 2013 3:59 PM
> *To:* public-tracking@w3.org
> *Subject:* Re: de-identification text for Wednesday's call
>
>  
>
> Shane,
>
> Thanks for your response, and I suppose we can talk more tomorrow,
> though we should probably try to identify a path forward as opposed to
> simply arguing. For example, I would welcome any empirical
> investigation into the efficacy of various approaches, including
> operational ones.
>
> I think the spectrum is wider than you suggest and ranges from no
> change at all to deletion of data. In practice I think deleting data
> is easy and for many actors the most straight forward way to comply
> with DNT. My position is far less aggressive than insisting on
> deleting data. On what I would consider to be the "lighter" end of the
> spectrum, there are solutions like yours that espouse an undefined mix
> of technical data manipulation measures and operational controls.
> Having worked in industry, I'm very skeptical of the value of
> operational controls, and think that in practice there will be very
> little accountability. If you'd like to provide more detailed
> background about the operational controls you have in mind or
> empirical evidence of the efficacy of such controls, I am certainly
> willing to listen.
>
> I agree that we shouldn't be too prescriptive about particular methods
> of de-identification, but I think it's important for the standard to
> come a clear conclusion about the place in the spectrum that counts as
> good enough. I don't think that saying, "there's a spectrum, and
> anything even on the lighter side is OK because companies will be held
> accountable" is the right answer. My view is that we need to be
> specific enough to define the contours of what is acceptable, without
> being too prescriptive, and that's the balance that I hoped to strike
> with my normative and non-normative text.
>
> On 04/02/2013 02:36 PM, Shane Wiley wrote:
>
>     Dan -- my apologies -- I used those terms to represent one end of
>     a spectrum.  Let's step away from those particular terms to
>     represent the most conservative side of the spectrum and instead
>     speak to the range of possible points of application of consumer
>     protection with respect to de-identification.
>
>      
>
>     On one side of the spectrum we have a possible solution that is
>     completely based on administrative and operational controls (as
>     you pointed out earlier) and on the other side of the spectrum is
>     a possible solution that completely relies on technology alone
>     ("and provides for leeway as anonymization technology improves"). 
>     In the middle is a solution that is a mix and allows organizations
>     the latitude to devise a solution that matches their ability to
>     mitigate risk and maintain some value in data -- while not being
>     overly prescriptive on the specific technology employed.  All
>     points on the spectrum focus on the desired outcome -- that
>     de-identified data not later be re-identified. 
>
>      
>
>     While we both agree on the outcome (I believe), we disagree on the
>     best route to get there.  I would suggest we move away from
>     prescriptive elements in the text and provide organizations the
>     flexibility to innovate and build solutions they believe meets the
>     goal.  As long as a company is stating they will meet the goal,
>     you have all of the necessary commitment to hold them accountable
>     to that outcome.
>
>      
>
>     *Evidence:*  I believe the perceived harm is that de-identified
>     data can later be re-identified, fair?  If yes, then any solution
>     in the spectrum serves that purposes but forces companies most
>     likely to the middle or more conservative side of the spectrum to
>     mitigate their risks without being overly prescriptive.
>
>      
>
>     *Industry implementation of a standard that is too weak will be
>     worse for users than having no standard at all, since it will
>     provide only the guise of protection:  *Compared to today's world,
>     even this "in the middle" solution to de-identification GREATLY
>     moves the needle in the correct direction -- I hope you agree.
>
> As a user, I'd prefer having no DNT at all to a de-identification
> approach that I would consider too soft. At least with no DNT, I have
> no false expectation that my data isn't being collected and retained.
>
>
>   Will there be bad actors or companies that overstate their ability
> to maintain a de-identified state and fail?  Absolutely.  This
> approach ensures they will be appropriately held accountable and at
> the same time allow good actors an attractive (voluntary) option to
> enhance consumer privacy.
>
>  
>
> *Your concerns:*  I do not mean to dismiss your concerns and hopefully
> have done a better job this time of fairly representing a shared goal
> and spectrum of options to achieve that goal.  I believe our best path
> forward is to agree on the goal and avoid prescriptive requirements on
> exactly how to achieve it.  While we each have our own views into what
> the rest of the Working Group holds out as priorities (which can only
> be proven through a vote which the W3C likes to avoid), I hope this
> compromise position both represents a meaningful advance for consumer
> protection for consumer advocates and provides enough flexibility for
> voluntary adoption by industry.
>
>  
>
> - Shane
>
>  
>
> *From:*Dan Auerbach [mailto:dan@eff.org]
> *Sent:* Tuesday, April 02, 2013 1:39 PM
> *To:* public-tracking@w3.org <mailto:public-tracking@w3.org>
> *Subject:* Re: de-identification text for Wednesday's call
>
>  
>
> Shane,
>
> Labeling my view as "totalitarian" or "absolutist" is inaccurate and
> not appreciated. My approach allows lots of leeway as anonymization
> technology improves, and takes into account that perfect anonymization
> is impossible. It seems to me that you plan to label anything
> "totalitarian" that suggests that keeping raw logs is not a viable
> approach in general. To be clear, I don't even want to suggest that
> keeping raw logs is *always* prohibited, just that it should be in the
> normal case where records have relatively high entropy. I'm also happy
> to discuss examples in great detail, but in my experience you have
> been unwilling to engage in discussing the nitty gritty details of
> what anonymization should entail.
>
> On 04/02/2013 01:09 PM, Shane Wiley wrote:
>
>     Dan,
>
>
>     As with HIPPA, I believe differentiated treatment of internal and
>     external datasets is appropriate as this changes the risk profile
>     of re-identification -- again, the root of the conversation being
>     a "risk-based" approach versus a totalitarian approach as you
>     suggest.  My solution meets the perceived consumer harm in this case
>
> Do you have evidence for this claim?
>
>
>
>
> -- yours of course does as well but goes far too far over the top of
> what is actually needed.  If your goal is to create a compromise
> end-point that will likely be implemented by industry then my
> recommended approach gets us there.
>
> Any compromise must provide meaningful protection to users. Industry
> implementation of a standard that is too weak will be worse for users
> than having no standard at all, since it will provide only the guise
> of protection.
>
>
>
>   If you'd like to instead stand by absolutist approaches, that is of
> course your prerogative and we'll have those removed through the
> standard W3C process.  I'm simply trying to save everyone some time
> and get to a meaningful outcome quickly.
>
> Glibly dismissing my concerns is not a way to gain allies, or move the
> W3C process forward. I don't think your view is as universally shared
> as you seem to think that it is.
>
>
>
>
>  
>
> - Shane
>
>  
>
> *From:*Dan Auerbach [mailto:dan@eff.org]
> *Sent:* Tuesday, April 02, 2013 12:21 PM
> *To:* public-tracking@w3.org <mailto:public-tracking@w3.org>
> *Subject:* Re: de-identification text for Wednesday's call
>
>  
>
> Shane,
>
> Why hash at all in this case? If you are relying on operational and
> administrative controls, you might as well just pledge not to look up
> the cookie when you receive it. If you are rotating (and discarding)
> salts frequently, then it will have a positive effect, but otherwise I
> don't think hashing provides any benefit here.
>
> But this is an aside to our main disagreement about the larger issue
> about the role that operational and administrative controls should
> play. I agree that they should play a role, but only after
> de-identification of data has been achieved. If the result of a DNT:1
> request is business as usual, with minor scrubbing and the caveat that
> only 4000 engineers at a large corporation get default access to a
> specially marked database instead of 10000, then that will not be a
> successful standard. (Of course I welcome more detailed information
> about operational and administrative controls.)
>
> One last point I wanted to make is that of course the data sets I
> mentioned refer to public data. We don't have access to internal
> corporate data sets. There are laws in place to protect the pilfering
> of that data, so of course no-one is going to steal data then publish
> an academic paper about it, effectively painting a big target on
> themselves for federal prosecutors and corporate legal teams. In light
> of this, the right empirical question to ask is: of large publicly
> available data sets that contain user data and are somewhat akin to
> log data, how often are there successful re-identification or
> attribute disclosure attacks? Can you point to any public data sets
> where such an attack has not been found?
>
> If your argument is instead that public data should be treated
> differently from non-public data, then I'd suggest that this is out of
> scope for the DNT conversation. DNT is about giving users the choice
> to opt out of tracking by companies, which must entail meaningfully
> curbing data collection and retention by that company, not merely a
> request that a company not make public its collected data. (Indeed, in
> addition to being an excessively weak demand by the user, this would
> in some cases be a vacuous request, since making that information
> public is already prohibited by law.) The de-identification question
> exists within the scope of what the companies themselves can do with
> the data -- is the data de-identified with respect to the entity that
> collected the data?
>
> Best,
> Dan
>
> On 04/02/2013 11:03 AM, Shane Wiley wrote:
>
>     Dan,
>
>      
>
>     Once the one-way hash is applied (and other elements of record
>     appropriately cleansed) the data is moved to a system that is not
>     allowed to be accessed externally.  Its these operational and
>     administrative controls that are essential to ensure de-identified
>     data is not re-identified at some later time.  I believe you're
>     looking only at the technical merits which is only seeing a small
>     portion of the overall solution.
>
>      
>
>     - Shane
>
>      
>
>     *From:*Dan Auerbach [mailto:dan@eff.org]
>     *Sent:* Tuesday, April 02, 2013 10:59 AM
>     *To:* public-tracking@w3.org <mailto:public-tracking@w3.org>
>     *Subject:* Re: de-identification text for Wednesday's call
>
>      
>
>     On 04/02/2013 08:50 AM, Shane Wiley wrote:
>
>         once the one-way hash function has been applied the data is
>         never again able to be accessed in real-time to modify the
>         user's experience.
>
>     I think I'm confused, can you explain this more? How is this
>     possible? If you are just hashing a cookie string, your web server
>     receives a request that includes a cookie string, you hash that
>     cookie string (which is in incredibly fast operation), match the
>     hashed cookie against the stored data, and return personalized
>     results.
>
>     Or are you salting the hash differently for every request, or
>     combining the cookie with an ephemeral piece of data (the
>     timestamp) before hashing and then throwing away the timestamp?
>
>     Thanks for clarifying, apologies if I'm just being dense.
>
>     Dan
>
>
>
>
>
>     -- 
>
>     Dan Auerbach
>
>     Staff Technologist
>
>     Electronic Frontier Foundation
>
>     dan@eff.org <mailto:dan@eff.org>
>
>     415 436 9333 x134
>
>
>
>
>
>
> -- 
> Dan Auerbach
> Staff Technologist
> Electronic Frontier Foundation
> dan@eff.org <mailto:dan@eff.org>
> 415 436 9333 x134
>
>
>
>
>
> -- 
> Dan Auerbach
> Staff Technologist
> Electronic Frontier Foundation
> dan@eff.org <mailto:dan@eff.org>
> 415 436 9333 x134
>
>
>
>
> -- 
> Dan Auerbach
> Staff Technologist
> Electronic Frontier Foundation
> dan@eff.org <mailto:dan@eff.org>
> 415 436 9333 x134


-- 
Dan Auerbach
Staff Technologist
Electronic Frontier Foundation
dan@eff.org
415 436 9333 x134
Received on Wednesday, 3 April 2013 07:43:34 UTC