W3C home > Mailing lists > Public > public-tracking@w3.org > July 2013

Re: what base text to use (was re: data hygiene approach / tracking of URL data) [for jmayer]

From: Nicholas Doty <npdoty@w3.org>
Date: Fri, 12 Jul 2013 15:58:56 -0400
Message-Id: <AD2BCEF4-8160-41C4-A1A0-D8EB804051A8@w3.org>
Cc: Jonathan Mayer <jmayer@stanford.edu>
To: "public-tracking@w3.org (public-tracking@w3.org)" <public-tracking@w3.org>
[Text is from Jonathan Mayer; sending to the list and submitting web form in order to make public and avoid questionnaire submission problems. —npdoty]

Objections to Option A:

1) Exclusions from "tracking" are textually limitless and allow for user profiling.

In the amended DAA proposal, "tracking" is scoped to "the domains or URLs visited across non-affiliated websites." Data that is not considered "tracking" would be exempt from use limitations, collection minimization, retention transparency, and even reasonable security.

Records of the following sort would be covered as tracking.

Cookie ID | URL                                                | Time
123       | http://www.webmd.com/hiv-aids/default.htm          | 7/11/13 (4:10pm PST)
123       | http://taxes.about.com/od/backtaxes/Back_Taxes.htm | 7/11/13 (4:13pm PST)
123       | http://sanfrancisco.gaycities.com/bars/            | 7/11/13 (4:15pm PST)
123       | http://www.wikihow.com/Quit-a-Job                  | 7/11/13 (4:19pm PST)

Cookie ID | Name           | Email               | Address        | ZIP
123       | Jonathan Mayer | jmayer@stanford.edu | 353 Serra Mall | 94305

But what about records like these, where the URLs have been modified by ROT13 and can be trivially recovered?

Cookie ID | URL                                                | Time
123       | uggc://jjj.jrozq.pbz/uvi-nvqf/qrsnhyg.ugz          | 7/11/13 (4:10pm PST)
123       | uggc://gnkrf.nobhg.pbz/bq/onpxgnkrf/Onpx_Gnkrf.ugz | 7/11/13 (4:13pm PST)
123       | uggc://fnasenapvfpb.tnlpvgvrf.pbz/onef/            | 7/11/13 (4:15pm PST)
123       | uggc://jjj.jvxvubj.pbz/Dhvg-n-Wbo                  | 7/11/13 (4:19pm PST)

Cookie ID | Name           | Email               | Address        | ZIP
123       | Jonathan Mayer | jmayer@stanford.edu | 353 Serra Mall | 94305

Or records like these, where the URLs have been grouped, such that the user went to one of the first pair of URLs and one of the second pair of URLs?*

Cookie ID | URL                                                                 | Group
123       | http://www.webmd.com/hiv-aids/default.htm                           | 1
123       | http://www.nytimes.com/                                             | 1
123       | http://www.mayoclinic.com/health/hiv-aids/DS00005/DSECTION=symptoms | 2
123       | http://www.washingtonpost.com/                                      | 2

Cookie ID | Name           | Email               | Address        | ZIP
123       | Jonathan Mayer | jmayer@stanford.edu | 353 Serra Mall | 94305

Or records like these, where the URL has been reduced to a set of features?

Cookie ID | Webpage Features                             | Time							
123       | Health, Self-Help, HIV/AIDS                  | 7/11/13 (4:10pm PST)
123       | Finance, Self-Help, Taxes, Back Taxes        | 7/11/13 (4:13pm PST)
123       | San Francisco, Gay, Drinking, Gay Bars       | 7/11/13 (4:15pm PST)
123       | Employment, Self-Help, Quitting, Job Hunting | 7/11/13 (4:19pm PST)

Cookie ID | Name           | Email               | Address        | ZIP
123       | Jonathan Mayer | jmayer@stanford.edu | 353 Serra Mall | 94305

The plain text of the DAA proposal would allow for all three of these practices.**  It does not define when URL data has been sufficiently altered to no longer constitute tracking.

Moreover, even supposing the DAA proposal were amended to require rigorous aggregation of website features, it would remain problematic for privacy.  The DAA design misses the forest for the trees: There is nothing *inherently* problematic about URL data. Rather, privacy risks flow from *what can be learned from* URL data.

Consider the following records, which include only highly aggregated interest segments.  Assume there is no reasonable way of mapping the data to URLs.

Cookie ID | Interest Segment
123       | HIV/AIDS
123       | Back Taxes
123       | Gay Bars
123       | Quitting Employment

Cookie ID | Name           | Email               | Address        | ZIP
123       | Jonathan Mayer | jmayer@stanford.edu | 353 Serra Mall | 94305

Under the DAA proposal, Do Not Track would allow a website to compile this sort of detailed dossier on a consumer—and keep it indefinitely, use it for any purpose, without transparency, and without security. We would be greatly deviating from both consumer expectations*** and policymaker preferences.

* For yet another related example, consider an implementation where each URL is assigned an independent probability that the user visited of < 0.5.

** Oddly, one provision of the proposal would seem to prohibit any use of unique identifiers save for the "deidentified" and "permitted uses" exceptions.

> Outside the permitted uses or de-identification, the third party MUST NOT collect, 
> retain, or share network interaction identifiers that identify the specific user, 
> computer, or device.

My understanding is that this passage is to be interpreted as a drafting error.

*** See, for example:

2) The deidentification scheme is textually undefined, and Yahoo!'s proposal fails to rigorously protect consumer privacy.

Like non-tracking data, deidentified data is *entirely* exempt from use limitations, collection minimization, retention transparency, and reasonable security. In exchange for this extraordinary reduction in information practice contraints, one would expect deidentified data to be rigorously privacy-protective. By that yardstick, the DAA proposal falls far short.

The textual "deidentified" and "delinked" definitions are unworkably vague and self-contradictory. If data "cannot reasonably be re-associated or connected to a specific user," then how can it still be "internally linked to a specific user"?  How is this data capable of being "reverse engineered back to identifiable data"?  Why are "satisfactory written assurance[s]" required when this data is shared? Why can't this data be "purposely shar[ed] . . . publicly"? The DAA proposal provides no non-normative guidance to cut through this definitional fog.

What's more, the one purportedly compliant implementation that we have heard—Yahoo!'s red-yellow-green proposal—provides little privacy protection. In a mid-2011 blog post,* Arvind Narayanan provided a taxonomy of various ways in which pseudonymous tracking data might be identified, including information leakage and deanonymization. Replacing one unique identifier with another does *nothing* to mitigate these privacy risks: a website would still retain an identifiable browsing history.**

In addition, it will often be trivial to reconnect a pair of "red" and "yellow" unique identifiers. For example:
i) Guess the mapping algorithm (e.g. a hashing algorithm with no salt or predictable salt).
ii) Know the mapping algorithm (e.g. a known hashing algorithm and salt).
iii) Have access to a black-box implementation of the mapping algorithm (e.g. be able to input one unique identifier, get the other).
iv) Use deanonymization techniques to link the identifiers based on associated data.
Any privacy gain would necessarily depend on controlled access to both the deidentification system and various datasets. Put differently, the Yahoo! proposal reduces to mere "operational or administrative controls." If the NSA can't get those right, how are consumers supposed to trust, say, an analytics startup?

* If it would assist the co-chairs in their decision making, I would be glad to produce an example reidentification on data that has been deidentified under the Yahoo! proposal.

** https://cyberlaw.stanford.edu/blog/2011/07/there-no-such-thing-anonymous-online-tracking

3) Websites have no obligation to adopt privacy-preserving technologies for permitted uses.

The DAA proposal omits any reference to privacy-preserving technologies. Where an alternative to present practices is available and accommodates consumer privacy concerns, why would we not encourage this win-win?

4) Websites have unfettered discretion to disregard a syntactically valid Do Not Track signal.

The text does not constrain when a website can ignore a "DNT: 1" header. Would a website that disregards all signals be compliant? What about most signals? What about a random subset of signals? There is neither normative line drawing nor non-normative guidance. Consumers cannot have trust in a Do Not Track system if a website can claim compliance, but then pick and choose among headers. 

Objections to Option B:

In its current form, I would not favor the June draft as a Do Not Track standard. Among other substantive concerns, many of which also apply to the DAA proposal:

1) Third-party websites may continue to collect a user's browsing history for enumerated "permitted uses." Instead of specially exempting particular present business models, we should delineate information practices by their privacy properties. See:

2) The definition of deidentified data is vague and potentially unenforceable. And yet, deidentified data is exempt from use limitations, collection minimization, retention transparency, and reasonable security. We must be much more precise given these implications of the definition. Non-normative text would be a good starting point. For example, Yahoo! has proposed a deidentification scheme—is it compliant? See:

3) Language on shifting away from unique identifiers is also ambiguous and potentially unenforceable. What does it mean for an "alternative solution" to be "reasonably available"? If privacy-preserving technologies are not presently required, how much would they have to improve to become required? Since the design space has already been well explored by computer scientists, would privacy-preserving implementations never be required?

4) The provisions on browser compliance are vague. I understand that we cannot reflect all possible future implementations in our text. But couldn't we at least be precise about present, popular implementations?  For example, is Internet Explorer 10+ compliant? See:

5) Service providers are under no obligation to use technical measures to silo their data, despite this being a present best practice and often having minimal impact on services.  See:

6) A website is textually unconstrained in disregarding facially valid "DNT: 1" signals. (Further discussion under Proposal A.) See:

7) The text provides an undefined loophole for "transient" information practices. See:

8) Websites are not sufficiently responsible for promptly detecting, mitigating, and reporting violations of the standard. See:

I also object to continuing from the June draft (Option B) on process grounds.

Our choice set is artificially constrained to two non-consensus documents: the June draft (the product of behind-the-scenes negotiating, with ambiguous authorship) and the DAA proposal. What happened to the longstanding, consensus Editor's Draft? What happened to the privacy advocates' EFF/Mozilla/Stanford proposal? What happened to the browser vendors' proposal coming out of Sunnyvale? Setting aside legitimacy concerns, there are at least two substantial effects of this choice architecture.

1) We are choosing between a (purportedly) middle-of-the-road text and an advertising industry-backed text. And not just any advertising text, by the way: a text with novel "non-tracking" and "deidentified"/"unlinked" exemptions that are far beyond what we've discussed previously. Going into this decision, then, the thumb is already on the scales against browser vendors and privacy advocates. They don't even have proposals on the table. But even supposing the co-chairs select the June draft, the advertising industry still comes out ahead. Proponents of the DAA proposal will (understandably) require that the June draft be amended to incorporate at least some degree of the provisions that they drafted. How could it be a consensus document otherwise? Even if the June draft is selected as the base text, then, we'll move towards some hybrid of the June draft and the DAA proposal. This smacks of a "heads we win, tails you lose" property for the browser vendors and privacy advocates.

2) Written submissions will indicate whether participants favor the DAA proposal or the June draft. They will not, however, indicate whether either proposal can achieve consensus in the working group. Put differently, the group is expressing which of the two texts is *more* acceptable. But the group is not determining whether that text *is* acceptable, or even *close* to acceptable. Given the two dozen open amendment topics on the June draft, for example, that document plainly does not reflect a working group consensus. Proceeding with the June draft may be less effective than other options for working towards agreement, such as resuming the consensus-based Editor's Draft.
Received on Friday, 12 July 2013 19:59:05 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 17:39:53 UTC