W3C home > Mailing lists > Public > public-tracking@w3.org > April 2012

'do not cross-site track' response to Aleecia's outline

From: David Singer <singer@apple.com>
Date: Fri, 06 Apr 2012 21:48:03 -0700
Message-id: <EDA27F6C-7B5F-409D-A2FA-6C288C785136@apple.com>
To: "public-tracking@w3. org Group WG" <public-tracking@w3.org>
Friends

this is my 'homework' response.  I am still not sure if I *advocate* this, but I do see the advantages, and think it is worthy of discussion.  I am concerned that it represents a major change of a basis, and might cause delay as we review it.  Indeed, pressure of time means it hasn't had broad review inside my own company, even.

- - - - - - - -

Contributors to this proposal:  Dave Singer, inspired by Roy (who is nonetheless blameless, and was on vacation so wasn't even asked to help edit)

Basic Concept:

Instead of trying to define 1st/3rd parties, we abandon the 1st/3rd distinction, and instead define restrict tracking in such a way that the 1st/3rd party distinction is irrelevant.  (We still need a definition of 'party' in general, which this does not address.)  Basically, we restrict "cross-site" tracking.

Draft definition of tracking (the "tunnel vision"): 
  "Tracking is the retention by a party (site), 
      -- after a user's transaction is complete (served), 
      -- of data records that can or do associate that user with either 
          a) any other party (site), or 
          b) with data not collected from the user's direct transaction with the party (site) performing the transaction."

This says "party (site)" because, as today, we need to permit data to flow as long as obligations and liability flow with it, within an organization.  This document does NOT have specific proposals on how to manage that data flow, or how to define "party" or "site". That problem remains.

So this definition allows:
* knowing about another site *during* a transaction (e.g. 'please supply an advert to the BogVille Chronicle')
* retaining records of what happens between your site and someone who requests data from it ('Dave was served an ad for dishwashers') ("tunnel vision")
* retaining the results of user interaction with any site, for those sites to remember that interaction and its results
* using real-time data for targetting (e.g. geo-location from IP address, determining time-of-day at that location, and so on), *during* the transaction
* retaining *separate* records related to the user ('Dave was served an ad') and the site ('an ad was served for the BogVille Chronicle site') as long as these records are unlinked and unlinkable

It does NOT allow:
* exampleAd.com remembering Dave was on the BogVille Chronicle site and that's why Dave was served a dishwasher ad
* retaining the full source URL the BogVille Chronicle used to get the ad, when that URL  conveys information from, or identifying, another site, or info passed about the user
* retaining referer information
* combining the data from a user transaction with other data to work out who the user is, or facts about the user, and then retaining that
* social widgets recording your browsing history (without permission)

The big wins:
* no more squirrelly language on trying to guess the site the user thought they were visiting (1st party)
* no more squirrelly language on what constitutes 'interaction' to get promoted from 3rd to 1st party
* the specification is formally 'testable' without the 1st/3rd judgment call (given access to a site's retained information)
* no more worrying about re-directs being (from the browser's point of view) 1st, but 3rd from the user's point of view
* no more worrying about embedded/framed sites, or mash-ups; who is the 1st party and who are 3rd?
* we don't have to worry about raw log files that could easily be converted into user-tracking (as all the data is there); the restriction is a minor one on what gets logged in the first place
* no restricting 'ordinary' logging practices even for sites that offer embeddable content (e.g. a web badge, an embeddable widget) as long as they take care not to record  either or both of
     a) information that could identify another site
     b) information that could identify a user

I think that the last is huge: for 'ordinary sites' the amount they have to do to comply is proportionate to the amount of logging they do that is 'cross-site'. If their logs remember only about their own site, they're statically fine. In general, the amount of work for anyone to comply is proportionate to the amount of logging of user AND other-site info that they do.

For re-direction services, there may be work to do, since almost any logging (e.g. of the URLs) would involve identifying another site (so this means they had better not remember data that can identify users).



Part I: Parties

	A. A party is
            whatever we define it as.  This document doesn't address that question.

	B. A first party isa party.
	C. A third party isa party.

            There is NO first/third distinction.

Part II: Business uses /* or whatever we wind up calling this -- feel free to suggest something different */

	Note: unless you specifically document otherwise, this section is understood to ONLY APPLY TO THIRD PARTIES.
This section applies to everyone, as there is no 1st/3rd party distinction.

	For each of the seven potential business uses below, please indicate if:
		A. this particular use is never allowed under DNT
		B. this particular use is allowed with retention limits (describe)
		C. this particular use is allowed without retention limits (describe any other limitations)

For any permission in our specification to go beyond the definition, if the data ever gets used for a purpose other than the exception, that's a non-permitted use, and laws (e.g. liability) may apply.

Explicit standard permissions needed:
* outsourcing, as before;  if your records involve data that is about another site/party, the data is only available for use by that other site/party (and hence, not by you)
* user-granted exceptions: (e.g. "you may track my visits to other sites while I am logged on to TrackMyReading.com")


	1. Frequency Capping - A form of historical tracking to ensure the number of times a user sees the same ad is kept to a minimum. 
		C.  (As long as you only remember data about the ad-serving site, which is all you need.)
                     (also true for story-boarding)

	2. Financial Logging - Ad impressions and clicks (and sometimes conversions) events are tied to financial transactions (this is how online advertising is billed) and therefore must be collected and stored for billing and auditing purposes.
		C.  As long as you either take care to lose the user-identify information (e.g. IP address, user ID, and so on), or other-site information, or both.

	3. 3rd Party Auditing - Online advertising is a billed event and there are concerns with accuracy in impression counting and quality of placement so 3rd party auditors provide an independent reporting service to advertisers and agencies so they can compare reporting for accuracy.
		If user-information and other-site information are both recorded, then B, else C.  Retention until the audit has been performed (?).

	4. Security - From traditional security attacks to more elaborate fraudulent activity, ad networks must have the ability to log data about suspected bad actors to discern and filter their activities from legitimate transactions. This information is sometimes shared across 3rd parties in cooperatives to help reduce the daisy-chain effect of attacks across the ad ecosystem.
		B, I suspect, as both user-identify and other-site information will be recorded.

	5. Contextual Content or Ad Serving: A third-party may collect and use information contained with the user agent string (including IP address and referrer URL) to deliver content customized to that information.
		C.  Real-time data in the transaction is all fair game. It's retention that's not.

	6. Research / Market Analytics
		C.  As long as you use data you learned directly from the user, and not about or derived from another site or elsewhere.

	7. Product Improvement, or, more narrowly, Debugging
		C.  As long as you use data you learned directly from the user, and not about or derived from another site or elsewhere.



David Singer
Multimedia and Software Standards, Apple Inc.
Received on Saturday, 7 April 2012 04:48:53 UTC

This archive was generated by hypermail 2.3.1 : Friday, 21 June 2013 10:11:27 UTC