Re: ISSUE-5: What is the definition of tracking? from Jonathan Mayer on 2011-10-25 (public-tracking@w3.org from October 2011)

From: Jonathan Mayer <jmayer@stanford.edu>
Date: Mon, 24 Oct 2011 17:18:31 -0700
To: Sean Harvey <sharvey@google.com>
Cc: "public-tracking@w3.org Group WG" <public-tracking@w3.org>
Message-Id: <2D2AF074-CC5A-45AE-826E-6B49FE0AA976@stanford.edu>
A few responding thoughts below.

Jonathan

On Oct 24, 2011, at 11:12 AM, Sean Harvey wrote:

> This is a really constructive conversation and I’m doing my best to keep abreast of it so that we can properly reflect it in the strawman doc with Justin & the chairs.
> 
> I have a couple of thoughts about things I think we should consider that I want to add briefly.
> 
> Defining tracking is trickier than one might think, and we should be attuned to the long-term ramifications of whichever approach we take. Currently we’re focused on exception use cases and the temptation is to essentially define “tracking” as “everything but x”. Should we continue with this approach there are two issues we need to be aware of:
> 
> 
> 
> This will sound slightly pedantic, but the danger of forgetting something obvious or basic in the list of exceptions, for example referrer URLs have been mentioned, and there are other obvious examples of data sharing cross-site: HTTP headers, TCP/IP handshakes, etc. These are examples of cross-site data sharing with your browser that do not uniquely identify you to the server you’re interacting with (though the issue of uncommon http headers was briefly raised by EFF), but are data sharing nonetheless. 
> 
> I think it’s therefore important to add definitionally that we are talking about pseudonymous (or personal) identification of an individual, an individual browser instance, or an individual device for some business or other purpose. 
I believe the "aggregate data" exception gets at this issue.  When we briefly discussed it in Cambridge, there appeared to be some support for an exception that encompasses (to a very rough approximation) data that could not reasonably (with some strict bounds on "reasonably") be used to compile a user's cross-site browsing history.  Here's the language we used in our IETF Internet-Draft for an exception of this sort:

>    3.  Data that is, with high confidence, not linkable to a specific
>        user or user agent.  This exception includes statistical
>        aggregates of protocol logs, such as pageview statistics, so long
>        as the aggregator takes reasonable steps to ensure the data does
>        not reveal information about individual users, user agents,
>        devices, or log records.  It also includes highly non-unique data
>        stored in the user agent, such as cookies used for advertising
>        frequency capping or sequencing.  This exception does not include
>        anonymized data, which recent work has shown to be often re-
>        identifiable (see [Narayanan09] and [Narayanan08]).


Looking at the issue tracker, it appears we may have folded the aggregate data exception (ISSUE-34) into the outsourcing exception (ISSUE-23).  I see the two exceptions as conceptually distinct and would very much support distinguishing them.

I would strongly oppose limiting our definition of tracking to only cover pseudonymously identified or personally identified data.  There are a number of ways to track a user across websites without a single pseudonymous or personal identifier.
> The other danger of an exceptions-based definition of “tracking” is that it is highly restrictive of future business models in potentially unpredictable ways. Two years ago we would not be considering definitions of “first party” that may or may not include embedded video content from YouTube or like buttons from Facebook; and it is possible that we would have collectively written an exceptions-based standard that didn’t work very well in this new landscape. It’s therefore worth at least discussing if we want the definition to identify what we are trying to address outside the context of the exceptions – NOT that we make the same mistake on the other end by creating a harms-based definition, but that we quantify the harms we are trying to address and tailoring our definition of tracking to them to a degree.
I'm very sympathetic to wanting to discuss the policy motivations underlying the definitions we establish.  But I'm concerned that, in practice, the discussion would be a rat hole for the working group.  There's just too much material to cover, and there are some significant differences of opinion that would take far longer to iron out in the general case than in the context of specific definitions.  We trended towards an unproductive general policy conversation in Cambridge, in some measure at my prompting; in retrospect I think the co-chairs were wise to move on.

As for future business models, the extent to which they're problematic flows directly from our definitions.  Suppose we had defined Do Not Track a decade ago, back when third-party web tracking was first raising eyebrows.  If we had defined Do Not Track to cover collection of data across websites, as I believe we should, then application to the Like button and embedded YouTube clips would have been fairly straightforward.  If, on the other hand, we had defined Do Not Track as about only the current practice of using tracking data to target ads, then the Like button and YouTube embeds would have opened whole new debates.
> The dialogue on the Issue 5 email chain has only sometimes reflected one of the important conversations we had in Cambridge, and that was that cross-entity data sharing is a more foundational concern than the first/third party distinction, which is really just an imperfect short cut to the former.
I don't follow this point.  The first party vs. third party distinction has, in my understanding, been an attempt to carefully define the sort of organizational boundaries that give rise to privacy concerns.  I haven't viewed the definition as a shortcut in any sense - it does a lot of work.
> My opinion at this stage (though I’m certainly open to persuasion) would be that we need to note the following issues here:
> 
> ·         Should first parties be exempted only to the extent that they do not combine their data with individual-level data from third parties? If I’m a first party and I see a DNT header, should I still be restricted from adding data collected from a DNT-passing customer to individual-level data from an offline third party company’s database? Should I be allowed to append it or combine it with data collected with individual-level offsite data I have purchased?
> 
I see at least four related scenarios with offline databases.

1) A first party sends what it knows about a user to an offline data aggregator.  (The longstanding data aggregator model.)
2) A first party queries a data aggregator for information about a user and uses the response internally.
3) A first party queries a data aggregator for information about a user and shares the response (or some subset of the response) with a third party (e.g. acting as an advertising data provider).
4) A first party sends identifying information about a user to a third party, which then uses it to pull in data from or send data to an offline data aggregator.  (The Datalogix and BlueCava privacy policies provide some details: http://datalogix.com/privacy/ and http://www.bluecava.com/privacy-policy/.)

I would support earmarking this as a longer conversation that, while certainly important, can wait until we've settled more pressing definitional matters.  Offline data issues are somewhat freestanding from the online data issues we'll spend most of our time discussing.
> ·         By the same token there are instances where a "third party domain" is used by a publisher as a software tool for analytics or ad serving. If that third-party tool is combining data from multiple sites, then obviously that falls under the definition of “tracking”. But what if it is merely a software tool for the first party’s use only & the first party is the sole data owner?  In this instance it is probable that the third party software tool is "first party" under the current DNT definition options, which again emphasizes we are focused on cross-site data sharing rather than the first/third party distinction. 
> 
I believe the "outsourcing" exception (ISSUE-23) addresses just this scenario.  Shane and I drafted possible language a few weeks ago for ACTION-5.

http://lists.w3.org/Archives/Public/public-tracking/2011Oct/0009.html
http://lists.w3.org/Archives/Public/public-tracking/2011Oct/0000.html
Received on Tuesday, 25 October 2011 00:19:11 UTC