Re: [Issue-5] [Action-77] Defining Tunnel-Vision 'Do Not (Cross-Site) Track' from Roy T. Fielding on 2012-02-03 (public-tracking@w3.org from February 2012)

From: Roy T. Fielding <fielding@gbiv.com>
Date: Thu, 2 Feb 2012 18:25:56 -0800
To: David Singer <singer@apple.com>
Cc: "public-tracking@w3.org (public-tracking@w3.org)" <public-tracking@w3.org>
Message-Id: <F9322842-158B-4E1D-9549-06D0C36D9476@gbiv.com>
On Jan 29, 2012, at 8:15 AM, David Singer wrote:

> This is a revision of my previous email, and a response to Action-77, which is one of 6 (?) actions related to Issue-5.  Please ask questions as needed to clarify, and I will write a composite revised definition, so we can close Action-77, and (once that's been done for the other formulations) Issue-5.
> 
> This is an alternative to restricting tracking via a 1st/3rd party distinction. I want to emphasize, I am doing this to explore and learn, not to 'promote' any particular direction.  I hope people find it helpful.
> 
> (All these definitions etc. rely on being able to define "site" or "party", by the way.  I don't see how to escape that, as many have pointed out, since it's within a 'party' that information flows, and so on.)
> 
> 
> RULE
> 
> Informally, we allow sites only to record what they do and learn *directly* about the interaction between themselves and the user. 
> 
> The formal rule is this:
> 
> When DNT is on (1):
> Data records that both identify or could identify, a single USER, and also identify, or could identify, a single SITE (that is part of a Party),
> * MUST identify or be capable of identifying no other Party, or site that is part of any other Party;
> * MUST be derived only from transactions directly between the identified Party and the user, possibly combined with publicly available data, 
> * MUST be available/accessible only to/by the identified Party,
> * MUST NOT contain user-specific non-public information derived or passed, directly or indirectly, from any other Party,

I disagree with the way that this solution is being described.
I don't see why you've added Party all over it.

In particular, the notion that the data collected must not identify
any other site, in general, won't work very well because referrals
are essential and it is very difficult to control data that the
user might enter in a text dialog.  I think we should specifically
constrain referral data alone (as provided in URI, Referer, or Origin)
and have the constraint be about operational use limits rather than
collection, and that retention be limited to operational needs.
Hence, a frequency capping ad server that doesn't need referral data
cannot retain it other than in aggregate form, but an ad auditing
service can retain it in a form that might be associable with a user
for only a limited time (after which the record must be anonymized
or otherwise disassociated with any user).

Likewise, there is no need to talk about Party.  There is a service
to which the user provided data.  The user has given consent to that
service to make use of that data.  Party is irrelevant.  Owner is
irrelevant.  Operator of the service is only relevant to the extent
that they are the ones responsible for adhering to the constraints.

If the operator controls more than one service, they must silo the
data between those services when DNT is enabled.  Note, however,
that a shared authentication service, when used to login to
multiple sites, might be the source of additional consent for
sharing data across sites if that *option* is provided to the user.

> If the data is held by another party on behalf of the identified party, that holding party MUST have no rights to use the data.

Too much partying.

Data collection may be contracted to some entity other than
the site operator (just as site operation may be contracted to some
entity other than the domain owner).  Such outsourced operations
are considered to be the *same site* if the data collected is siloed
to that site, is controlled by the same entity that controls the site,
and the data processor acting as that site's agent is contractually
bound to do all the things (as previously discussed under outsourcing)
that makes them a data processor and not a data controller.

> Records derived when DNT is on (1), MUST be held separately from other data derived when DNT is not on (1).

That's not possible in general.

I'm not sure what that is trying to accomplish.  If it is just to
prevent re-identification, then simply "MUST NOT be combined with
records collected when DNT is not enabled" should be sufficient.
Or perhaps, reading on, what you mean is that "a site that retains
user-specific data MUST distinguish users with DNT enabled from
users with DNT not enabled, such that they are considered different
users and their associated records are never combined."  Note that
this will have an effect on user experience, though I think it
is a reasonable one.

> EXCEPTIONS
> 
> not needed:
> 
> Outsourcing exception: not needed, it's part of the rule in the first place.
> 1st-party exception: not needed: all sites/parties are allowed to remember the user's interactions with them.
> Unidentifiable data exception: not needed, as the definition here only concerns user-identifiable data in the first place (which can probably be true for all rule sets)
> Operational exceptions:
>  frequency capping, story-boarding: not needed; the ad site is permitted to remember what IT served YOU, just not a lot of why (which 1st party you were on, etc.)
>  financial logging: separate un-identified records can be kept on the number of impressions on a 1st-party site (why is this not true for all proposals?)
>  3rd party auditing: again, is it necessary to keep a record that identifies a specific user?

That's a good question for the third-party auditing folks.
I wouldn't think so, though they might want to retain a site-specific
hash of the IP address.

> potentially needed:
> 
> Operational exceptions:
>  security/fraud: an exception may be needed here, especially if cross-site fraud is to be detected

Yes, though the records kept are not intending to identify a user --
they are actually trying to identify non-users or fraud-intending users.
Hence, they look for patterns in the data that would be inconsistent
with a financially-disinterested human using a browser, such as that
of a bot farm, low-paid labor, or misdirected traffic.  It is very
hard to do that without retaining the same data that could identify
a user.  However, use of that data could be limited, as could the
retention period for user-identifiable data that has not been
flagged as fraudulent.

>  research/market-analytics: we don't have a current formulation, and the title is broad enough to allow almost anything, so I can't tell

AFAIK, that can be limited to non-identifying aggregates for
all of the tracking cases.

>  product improvement: this is an issue, again with a serious risk of slippery slope

Likewise, can be limited to non-identifying information, though
for this case we don't need an exception for same-site product
improvement.

>  debugging: yes, an exception may be needed for debugging

I don't know.  In practice, an error log is separate from an
access log, though the two can be combined.  I don't see why
we'd need such an exception since the user is not being tracked
across sites, but can be tracked within a site, and I don't see
the need for cross-site debugging of real DNT users.

> Legal exception: tracking to the extent required by law
> 
> Comments on TUNNEL-VISION 
> 
> If a user runs sometimes with DNT:0 and sometimes DNT:1, they will end up with two records at sites, one with a lot of other-site data, and the second record with tunnel-vision.  Correlation by the site would enable merging these; this is the weakest aspect of this strawman, IMHO.  Under the alternative 'cross-site' formulation, I think each site would keep N+1 records (1 for when DNT is off, and N for the number of 1st party sites 'seen' by this 3rd party for this user).
> 

> Frequency capping and storyboarding by advertisers are now permitted; you ARE allowed to remember what ad you showed this (anonymous) user, since that was *your* transaction.  You're limited in remembering only site-generic 'why' -- you cannot remember 'they visited Sears and so I showed a dishwasher advert'.  
> 
> If the user starts interacting with *you*, you can remember that also; we don't need language to make this an exception, or 'promotion' from 3rd to 1st party.
> 
> Redirection services can remember basically only that the user was active on the web, since everything else they know (the original URL, the re-direct) either identify or could be used to identify another site.
> 
> The attraction of this rule is that many fewer exceptions are needed.  The downside of this formulation is that it relies on sites not to re-correlate the records, though there is still a lot of data that cannot be recorded.

Note that this is true, in general, of all the suggested definitions
for tracking.  We have no ability to prevent bad actors.  We can only
state the constraints and hope that regulators and journalists deal
with those that fail to adhere to the constraints.  If one of the
constraints is that a site MUST NOT correlate DNT-on data with DNT-off
data, then that is just as effective as a constraint that says a 
third-party cannot collect the same data.  An evil party will, regardless.

....Roy
Received on Friday, 3 February 2012 02:26:22 UTC