Re: Migrating some high-entropy HTTP headers to Client Hints. from Ronan Cremin on 2019-04-17 (ietf-http-wg@w3.org from April to June 2019)

From: Ronan Cremin <rcremin@afilias.info>
Date: Wed, 17 Apr 2019 17:42:24 +0100
To: Yoav Weiss <yoav@yoav.ws>
Cc: Thomas Peterson <hidinginthebbc@gmail.com>, Mike West <mkwst@google.com>, HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <5b492ae2-ea0b-c10b-f012-637413ef304a@afilias.info>
Yoav,

Thanks for the reply, comments below.

Regards,
Ronan

On 11/04/2019 22:47, Yoav Weiss wrote:
> Hey Ronan,
>
>
> On Thu, Apr 11, 2019 at 8:11 AM Ronan Cremin <rcremin@afilias.info 
> <mailto:rcremin@afilias.info>> wrote:
>
>     Hi,
>
>     My name is Ronan Cremin, I help to build a device recognition product
>     widely-used in the web analytics, publishing and advertising
>     industries.
>     Full disclosure: my employer profits from analysis of UA strings,
>     though
>     moving the same information to client hints is not expected to impact
>     this materially.
>
>     One concern over moving UA string information to Client Hints is that
>     the information required to publish device-specific responses arrives
>     only in the second request from the client. This imposes a
>     performance
>     penalty on publishers that serve a device-tailored HTML document. As
>     Mike mentioned, RWD notwithstanding, many publishers employ
>     device-specific responses as envisaged in RFC1945, usually to
>     tailor the
>     experience to a class of device e.g. smartphone, tablet, desktop
>     and so
>     on. 
>
>
> The viewport Client Hint can provide such distinction, but it exposes 
> more bits that are actually needed, so not a great option to expose by 
> default, without an opt-in.
> From your description, maybe exposing another tri-state hint by 
> default will be enough to cover the use-case and maybe it not expose 
> too much bits about the user in the process.
> Can you open an issue on 
> https://github.com/WICG/ua-client-hints describing your use-case?
Sure, I'll create an issue.

For the record, I'm not sure if a tri-state will do it. In my statement 
above I perhaps over-simplified the description. Many publishers use 
content adaptation to a device-specific level, not just the broader 
device class.
>
> If we were to conclude that something like that is privacy-safe, I 
> guess the main problem would be to define where is the line drawn 
> between a phone and a tablet, and between a tablet and a laptop.
> I suspect a standard definition of those borders is likely to become 
> stale fairly quickly...
True, any boundary definition along these lines would not last long 
enough to be useful—it's already a continuum; the ‘phablet’ term 
represents the problem here and folding screens will further blur the lines.
>
>     Publishers endeavour to fit everything required for the first screen
>     of content into this first response, so a delay to this impacts
>     performance. The last time I checked more than 80% of the top 100
>     websites used this technique.
>
>
> When was that? Do you have data you can point us to?
It was a several years ago so I ran another quick analysis. My test does 
a GET on the main site entry point with a few different user-agent 
strings. I'm not fetching any additional linked resources, nor storing 
cookies or any other state. Public data for Alexa rankings is now 
limited to the top 50 sites, so that's what I used.

In summary, greater than 80% of the top 50 Alexa sites are using device 
adaptation at some level, with the majority offering more than two 
adaptations.

I'm using the byte size of the first HTML response as an indicator of 
device adaptation. I believe that this is crude, but workable. Some 
sites were blocked by my corporate firewall but in checking them 
afterwards via a VPN the percentage using UA adaptation only increases.

https://docs.google.com/spreadsheets/d/1ZPmpwsowowgLugVCEZuKaIeJUKw17jTjLN3iDzfK1AA/
>
>
>     Web analytics might also be impacted. Most web analytics solutions
>     support a JavaScript-free integration approach based on linking a
>     single
>     pixel image hosted by the analytics platform. The ability to do
>     this is
>     impacted for the same reason—the information required for analytics
>     becomes available only on the second request from the client.
>
>
> I'm not sure that's a winning argument, as it sounds like those 
> analytics vendors exploit the current UA string to extract bits of 
> information from passive requests.
> The current proposal will enable them to do the same (with the same 
> number of RTTs), but only after an explicit opt-in to receive that 
> data from the browser. An opt-in that can be monitored by the browser, 
> extensions and privacy researchers.
Exploit is possibly a loaded word here. Sending a UA string is a SHOULD 
requirement in the HTTP RFCs, not a MUST. User agents /voluntarily/ send 
a UA string, they are not obliged to, just like they wouldn't be obliged 
to send Hints per the proposal. Further, the contents of the UA string 
are entirely up to the browser and thus they are already empowered to 
decide how many bits of information to include in them. The Safari team 
have already elected to reduce the meaningful bits.
>
>
>     Has thought been given to the performance impact of the proposal?
>
>
> Yes.
>
>     Yoav
>     mentions this issue in his Client Hints infrastructure document
>     (https://github.com/yoavweiss/client-hints-infrastructure) but I
>     haven't
>     seen any attempt to quantify the impact.
>
>
> As indicated in the document you linked to, we currently don't have a 
> great way to make fingerprinting-bits-exposing Client Hints an opt-in 
> while keeping sending those on the very first request.
> That's unfortunate and we hope to improve on that in the future.
> At the same time, the User-Agent string is exposing many bits of 
> entropy, so it is a privacy hole we're interested in blocking.
Improving privacy is a laudable goal, I can't imagine that anyone would 
take issue with this.

But it's worth asking the question if the cost of attaining the goal is 
worthwhile. For the particular case of moving the user-agent string to 
Hints (rather than Hints in general), there are 20+ years of 
infrastructure based around RFC1945 and its successors. There are known 
performance penalties in moving the UA string to Hints. Add to this the 
fact that there isn't yet a consensus whether this move would improve or 
disimprove privacy and I think it's fair to say that there is a debate 
to be had around the overall merits of the proposal.
>
>
>     Regards,
>     Ronan
>
>     On 29/11/2018 12:08, Thomas Peterson wrote:
>     > I would propose that all Accept* headers are included in Client
>     Hints
>     > as all can be used for some level of fingerprinting, e.g. Accept
>     can
>     > used to distinguish between desktop browsers (which typically have
>     > html/xml MIME types) and cURL/wget which by default have '*/*'.
>     Many
>     > user agents also do their own guess work on response bodies anyway
>     > (such as looking at the magic number) to determine content type or
>     > encoding, so the impact of a "failed negotiation" of content can be
>     > limited.
>     >
>     > Also, Is there a particular reason why Sec-CH-Lang omits Quality
>     Values?
>     >
>     >
>     > Regards
>     >
>     >
>     > On 29/11/2018 10:22, Mike West wrote:
>     >> Hey folks,
>     >>
>     >> Section 9.7 of RFC7231
>     >> <https://tools.ietf.org/html/rfc7231#section-9.7> rightly notes
>     that
>     >> some of the content negotiation headers user agents deliver in
>     HTTP
>     >> requests create substantial fingerprinting surface. I think it
>     would
>     >> be beneficial if we took steps to reduce their prevalence on the
>     >> wire, and Client Hints looks like a reasonable infrastructure
>     on top
>     >> of which to build.
>     >>
>     >> `User-Agent` and `Accept-Language` seem like particularly tasty
>     and
>     >> low-hanging fruit, and I've sketched out two proposals as
>     proofs of
>     >> concept:
>     >>
>     >> *   `User-Agent` could be represented as ~four distinct hints:
>     `UA`,
>     >> `Model`, `Platform`, and `Arch`:
>     >> https://github.com/mikewest/ua-client-hints is a high-level
>     >> explainer, and
>     https://tools.ietf.org/html/draft-west-ua-client-hints
>     >> a sketchy ID for the new headers.
>     >>
>     >> *   `Accept-Language` could be represented as a `Lang` hint:
>     >> https://github.com/mikewest/lang-client-hint is a high-level
>     >> explainer,
>     https://tools.ietf.org/html/draft-west-lang-client-hint an
>     >> equally sketchy ID for the new header.
>     >>
>     >> I'd appreciate y'all's feedback. Thanks!
>     >>
>     >> -mike
>     >
>
>
>
Received on Wednesday, 17 April 2019 16:43:01 UTC