- From: Roy T. Fielding <fielding@gbiv.com>
- Date: Sat, 25 Feb 2012 00:23:04 -0800
- To: Matthias Schunter <mts@zurich.ibm.com>
- Cc: public-tracking@w3.org
This had the wrong subject -- changed to ACTION-133.
On Feb 24, 2012, at 3:15 AM, Matthias Schunter wrote:
> Hi Folks,
>
> I created a table in the W3C Wiki to start comparing both approaches:
> http://www.w3.org/wiki/DntResponseHeaderOrURI
>
> Feel free to correct, augment, improve my initial (likely to be
> subjective) assessment.
I would have to remove your assessment entirely, since they don't make
any sense to me. Adding one personal opinion on top of another, with
more personal opinions to be overlaid, doesn't work very well,
particularly with an imaginary ++/-- valuation. And a couple of
those entries seem reversed.
Let's get the personal opinions out here first and then use the wiki
(or the draft) to document what we actually agree are facts.
Here is my summary based on the criteria in the table:
Criteria:
Transmits tracking status
Both solutions are equally expressive. Both are dynamic when
they need to be. The resource can echo the client's DNT setting
without impacting normal request caching. The header cannot.
Enables enforcement by regulators
Both solutions enable enforcement. The headers tell the user the
tracking status of a request that was just made. The resource tells
the user the tracking status of all resources matching a specific
URI prefix for a specific time period (no less than 24hrs).
The resource status can be viewed, archived, and printed by any user
using any browser, crawled by spiders, and indexed by search engines
(custom or general), whereas the header field is only viewable by
tools that normal users don't use and require a separate tool to save
them for archival purposes. The resource status could be further
extended with fields for digital signatures, though I doubt that
would be necessary.
Granularity
Whether it is per-request is not relevant. Per-resource is.
Both solutions can differentiate specific policies per specific
resource, if that is how the origin server wants to implement
their site. The header field informs the user after the request
has been made. The status resource defines a scope of applicability
that may result in two extra requests for an agent that is
actively verifying tracking status.
Simplicity of user agent
Reading a header after the request has been made is usually easier
than making a separate request. OTOH, finding out the status
after a request has been made is less useful than before. A JSON
response is less likely to be lost by intermediaries and easier
to process by javascript and extensions that might not have access
to the HTTP header fields.
Traffic generated
Response header: Roughly 8 bytes per response minimum on every
response to every request made over HTTP. Estimated traffic
generated is some number of terabytes per day. For example, if we
take www.google.com alone at 1 billion searches per day, with each
search invoking roughly 14 subrequests, we have a minimum of 120GB
per day of extra traffic generated at that site alone, regardless
of whether the user agents care to receive that information.
Status resource: Roughly 1kb per site visited per day per actively
verifying user agent, excluding those sites that the user
agent has chosen to always-ban or always-accept. Estimated
traffic is some number of megabytes per day for all sites combined,
depending on how many users choose to enable active verification
and how many sites require a dynamic response (i.e., tracking).
Note that verification is *not* necessary to satisfy DNT, so the
traffic generated by the status resource for DNT enabled without
active verification is zero.
Robustness wrt caching
Response header: if it doesn't echo the user's request, then it has
no additional impact on caching -- tracking resources are typically
marked as non-cacheable or at least must-revalidate. Deployed
intermediaries might fail to forward the response header, though
I think that is unlikely (failing to forward a new request header
field is more common, but that will be fixed over time).
Status resource: it is a separate resource, so can be separately
cached, delivered by separate servers, redirected to common
locations, etc. In short, it is equivalent to favicon.ico
except that only a small number of user agents would make the
request.
Tracking protection on info resource [you reversed the +/- values here]
Response header: The header proposal uses a well-known resource for
supplemental information.
Status resource: The status resource is the info resource.
> Comments / Questions for Well-known URIs:
> o Is there a way to prevent that each URL needs to always be checked at the well-known location? E.g., retrieving foo.com/bar/one requires checking foo.com/.well-known/dnt/bar/one. If I now want to retrieve foo.com or foo.com/bar/one/sub, I need to re-check. Don't I? Wouldnt this double the web traffic (sort-of?)
First of all, nobody *needs* to make any checks. DNT is still enabled
without needing to check. Verification of status is an optional feature
that does nothing to ensure compliance -- it merely provides a means to
obtain that status if an agent wants to know what the server claims and
to record that claim for posterity.
When verification is desired, the first request is to "/.well-known/dnt".
The scope of its applicability is defined by the path member in the response.
If (and only if) that resource does not apply to the target URI, then a second
request is made on "/.well-known/dnt/target/path". It is extremely unlikely
that a third-party site is going to have more than one tracking policy per
site, but this mechanism allows for that case without adding overhead to
the common case of one policy per domain.
Other criteria:
Deployability (how easy is it to add it to existing web sites)
Response header: Assuming the site owner knows what a header field is
and knows how to configure their server to send the header and
has permission to do so by the site operator, then this can be
configured via a SetHeader rule (if static) or a custom module
for those folks on Apache. For dynamic resources, some code
modification might be required depending on how they are implemented.
This proposal also requires a well-known address with the ability
to process query fields.
Status resource: For sites that don't track, add a single file with
the content
{"path":"/","tracking":false}
and assign it the application/json type. For sites that do track,
a dynamic response can be achieved with any common template language,
custom module, or CGI. More importantly, since the status resource is
an entirely separate implementation than the existing resources, there
is no need to worry about breaking the existing site. Sites that
use many different domains with a single policy can redirect to one
location. Even complicated sites like Yahoo! could deploy this in a
single day, since it would involve no risk to their working apps.
Request latency
Response header: Every response has to include a header field prior to the
content being sent, which means it adds a small latency to every response.
Status resource: If no active verification is needed, no latency is added.
If verification is on but done asynchronously (not prior to making the
actual request), then the only noticeable latency would be the general
overhead and use of connections on the user agent. If prior verification
is enabled, then substantial latency is added to the first request of a
site due to the additional one or two requests (if not already performed
for that site). However, prior verification isn't even possible with the
header proposal.
Third-party verification / Measuring deployment
Response header: A third party can crawl a site to see if every one of its
exposed resources is flagged as respecting DNT or not, assuming that
the sites don't mind a crawler that doesn't respect robots.txt and
adds false counts to its advertising counters. Right. That isn't going
to happen, and there is nothing to prevent the sites from sending a
different response to the crawler than it would to a user.
Status resource: A third party can crawl every domain on the web and safely
request the base well-known address, index the response, and make that
available to user agents (or regulators) for evaluation of deployment
or a curated list of claims-to-be-compliant sites. Each response has
a minimum TTL of 24 hours, longer if noted by expires or max-age.
Transparency
Response header: only indicates compliance or non-compliance, which means
the entire working group must agree to a single definition of tracking
that encompasses all necessary exceptions and somehow explain that to
users.
Status resource: indicates "tracking", "no tracking", or "tracking with
limitations", which allows the user agent to choose whether it wants
to distinguish tracking in general from tracking that is specifically
limited to categories-to-be-defined-later with an agreement to adhere
to data minimization. We can use the common definition of tracking if
each of the exceptions is defined as acceptable tracking with limitations.
Individual control over data stored
Response header: enabled via a separate well-known resource, though underspecified.
Status resource: enabled via links provided in the status response.
Applicable outside of HTTP
Response header: any protocol with response header fields.
Status resource: any protocol that has URIs with a path and the ability to get.
Cheers,
Roy T. Fielding <http://roy.gbiv.com/>
Principal Scientist, Adobe Systems <http://adobe.com/enterprise>
Received on Saturday, 25 February 2012 08:23:28 UTC