Re: ISSUE-5: Consensus definition of "tracking" for the intro? from Roy T. Fielding on 2013-10-16 (public-tracking@w3.org from October 2013)

From: Roy T. Fielding <fielding@gbiv.com>
Date: Wed, 16 Oct 2013 01:52:46 -0700
To: David Singer <singer@apple.com>
Cc: "public-tracking@w3.org (public-tracking@w3.org)" <public-tracking@w3.org>
Message-Id: <2CB50839-7865-45D3-89BA-D33481B42412@gbiv.com>
On Oct 15, 2013, at 2:30 PM, David Singer wrote:

> "Tracking is the retention or use by a site outside the first party, after a network transaction is complete, of data that is, or can be, associated with a specific user, user agent, or device."

Well, that is better, but again runs into the problems that I described
before:

  1) "first party" is itself a defined term, depends on the
     context of a user's action prior to the current request,
     and would have to include service providers.

  2) data is often retained "after a network transaction is complete"
     just for the sake of caching

  3) we need to include sharing, as suggested by the FTC commissioner
     in response to my question at the DC F2F.

and "by a site outside" seems to assume that tracking is done by
a site (instead of some party with access to data collected in a
given interaction).

BTW, your suggestion would be more readable as

  Tracking is the retention by a third party of data that can be
  associated with a particular user.

(i.e., use doesn't matter if we aren't constraining it for the
current interaction, and use isn't possible for prior interactions if
that data cannot be retained, and the only reason user agent and
device matter is because they can be associated with a user.)

I can see why you are suggesting this as a summary of what the
compliance spec is all about.  However, it does a poor job of
corresponding to what a user would think of as tracking.

In comparison, the short definition that I posted

  Tracking is the observation of a particular user's browsing activity
  across multiple distinct contexts and the retention, use, or sharing
  of data derived from that activity outside the context in which it
  occurred.

does not limit the definition to a particular software or role,
a specific number of protocol interactions or requests, or a form
of data that remains associated with a particular user.

...

On Oct 15, 2013, at 4:22 PM, David Singer wrote:
> On Oct 15, 2013, at 15:55 , "Roy T. Fielding" <fielding@gbiv.com> wrote:
>>> Yes, under some definitions they are:  if they (as likely) keep records, they are remembering data about you.  But they are a first party, so they get a big carve-out.
>> 
>> There are no carve outs in definitions.  The fact that a first party
>> has a big carve out in compliance is evidence that the definition is
>> wrong: users do not consider it to be tracking when a website they
>> intentionally use has retained data about their past use.
> 
> No, it *is* tracking data, the first party *does* have some restrictions on what they can do with it. A scope that establishes "this spec concerns broadly data in category X" and then says what restrictions there are for various people on that data is perfectly normal.  But I can maybe live with a definition that excludes it, if it makes it easier for you (as earlier offered).

That doesn't make it tracking data.  Yes, the compliance spec has those
restrictions, but they exist to prevent a third party from receiving data
that it might use for tracking the user if it receives the same kind of
data from multiple sites.  My proposal accounts for that in a more
straightforward way.

>>> No, it omits (a) data used to service transactions (within the interaction) and (b) data not associated with a specific user etc.  That's a lot of data.
>> 
>> No, what it omits isn't relevant because the data will be retained.
>> IP addresses get stuck at all layers.  Any application that involves
>> security has an audit trail.  Every first party website has an access log.
> 
> We have a security permission, and a first-party permission, and a raw data permission. It is tracking data, but you can keep it under some restrictions.

We have those permissions for third parties.  Here we are talking
about data collected by the first party.

>>> No, they are anonymous *to the organizations that they didn't choose to interact with, and for the most part are unaware of*.  We *have* a first-party carve-out, long-since agreed.  But even the first party has some restrictions on what it does with 'tracking data' (like, not sharing it around).
>> 
>> Right, so we need to define tracking in a way that corresponds to
>> what DNT intends to turn off. Other definitions would intentionally
>> mislead users.
> 
> Well, we also need to be careful not to mislead ourselves or site implementers.  They read specifications; users typically do not.

My intention is that the definition, once defined, be provided
consistently by all implementations.  I don't expect a user to read
the specification.

>> You have not indicated that there is anything wrong with my proposal.
> 
> 1. Who is doing what 'across multiple distinct contexts'?  This is an undefined part of your definition.  Yes, that may be the aggregate effect, but we need to know (the users, and the sites, need to know) is 'this single possible action by a site' within the definition of tracking or not?

The user's browsing activity is observed across multiple distinct
contexts.  It means that observing the user's activity only within
a single context is not tracking.  The reason it is there is because
the verb tracking and the privacy concern we are trying to address
are both about identifying the trail of an individual as they
proceed from place to place.  Specifically, remembering that a
person was at a single place is not tracking unless that memory
is shared with someone else or combined with memories of other
places.

"who" is doing the tracking is not important with this definition,
as one would expect if they were looking up the term in a dictionary.

> If I leave it out, is the definition worse? If so, how?

It changes the scope of data that the reader will expect to
be subject to the constraint.

> "Tracking is the act of following a particular user's browsing activity, via the collection or retention of data that can associate a given request to a particular user, user
> agent, or device, and the retention, use, or sharing of data derived from that activity outside the context in which it occurred."

It means that the first party examples I gave earlier are considered
tracking, whereas my definition excludes them because the observations
are limited to a single party's context.

> 2. This definition does not exclude, as mine does, the use of data to answer the request, i.e. it doesn't have a clear idea of when "tracking" starts.  I think it starts after you've satisfied the request (HTTP request, and its response).  I would put back in "after a network transaction is complete".  Even if 'retain' has that meaning, it leaves it ambiguous whether I can use the data to respond to you (and we may as well be clear that you may).

Responses are always in the same context, so that isn't a concern
under my definition.

> 3. This leaves off the table conclusions that the site can draw about the user.  So, imagine I detach the actual request log from the user, somehow, so they are no longer connected, but I remember
> * Roy was in California, online, and visiting the web at 3pm pacific on Sept 25th
> * Roy is interested in recipes that use brown lentils
> * Roy is able to visit sites that offer alcohol for sale, and buy at them; he's probably an adult

That is all data derived from the observed activity.  If it is used only
within the context in which it occurred, then no problem; otherwise, it
becomes tracking when used, retained, or shared outside the context in
which that activity occurred (because that makes a track).

I've gone back and forth regarding whether the definition of tracking
ought to exclude data derived from the observed activity after it has
been de-identified (i.e., no longer associated with a particular user).
I'd like to say "retention, use, or sharing of XXX data derived
from that activity", but there is no good adjective for XXX given
historical concerns over the terms "personal data" and PII, conflicting
common meanings for the terms "identified" and "linkable", and a bad
taste in my mouth for something like "non-de-identified".

> There is a whole host of data you can remember about me that is not specifically tying me to a given request, under this definition. I don't think that is acceptable.

I don't see anything in my definition that is restricted to tying you
to a given request.

There is a whole host of data that I can remember about you that has
nothing to do with tracking.  It isn't our job to prevent Web sites
from knowing their own customers, for example, since we are not
working on a protocol for anonymous browsing.

....Roy
Received on Wednesday, 16 October 2013 08:53:10 UTC