pushing for progress on 3.6: unlinkable data and logging

Hi Shane and all,

I wanted to forge ahead on how information should be logged for DNT:1 users after a retention period where raw logs are allowed. This was an area of debate at the f2f, and there seemed to be an impasse about how data should be anonymized, and the extent to which is should be rendered unlinkable. I think a meaningfully privacy protective change to logging is crucial to the DNT spec, and so drilling down on this issue is quite important.

To be clear, I understand that changing one's logging pipeline does not happen overnight and am happy for there to be a transition period for companies to get their houses in order. Moreover, in addition to the transition phase, keeping a short period of un-anonymized logs seems sensible to me, for permitted uses and to allow companies to have failure-resistant logging pipelines. But let's talk about data after this raw log period and transition phase.

I'd like to have the 1024-unlinkability as a starting point, though am not insisting on this particular standard. However, in order to make progress, I'd like to have some concrete examples of the difficulties of keeping only aggregated data (after the N week period where raw data is allowed). There were general statements made at the f2f about how it's hard for small businesses to change their logging schemes, and also there was some idea that storing only aggregated data would be more limiting. But the discussion was vague, and I'd like specifics. Without specifics, it will be hard to make progress and I remain skeptical of the claim that it is infeasible to keep only aggregate data.

I also want to be clear that I think that hashing certain identifying fields and occasionally rotating the salt is NOT a reasonable solution. Depending on the fields hashed, the frequency of the salt rotation compared to how often Internet users organically change IPs, cookies, etc., it might be close to 0% better than regular logging. This is not even taking into account the ineffectiveness of hashing data that lives in a small space as Ed has pointed out. Shane, would you care to flesh out this scheme more? What fields would be hashed? Would URLs and referers be hashed? If not, it seems deeply problematic to me that companies would continue to be able to link together, e.g., a request to example.com/shanes_private_url with a request to example.com/url_of_sensitive_medical_condition. If URLs/referers are hashed, though, this seems far less valuable to companies than aggregated data grouped by the actual URL -- the solution I'm advocating. How often would the salt be rotated? Would the salt be deleted entirely?

I really hope we can make progress on this issue, as I'm genuinely sympathetic to real-world issues companies face and think allowing aggregated data is completely reasonable. But if the thrust of your point is that most companies wouldn't adopt a DNT standard that requires a non-superficial change to their logging practices, then I don't think those companies are taking DNT seriously enough to provide users with real privacy protection.

Dan

-- 
Dan Auerbach
Staff Technologist
Electronic Frontier Foundation
dan@eff.org
415 436 9333 x134

Received on Friday, 12 October 2012 23:25:27 UTC