Re: Draft finding - "Transitioning the Web to HTTPS" from Mark Watson on 2014-12-15 (www-tag@w3.org from December 2014)

From: Mark Watson <watsonm@netflix.com>
Date: Mon, 15 Dec 2014 08:39:36 -0800
To: Mark Nottingham <mnot@mnot.net>
Cc: Noah Mendelsohn <nrm@arcanedomain.com>, "www-tag@w3.org List" <www-tag@w3.org>
Message-ID: <CAEnTvdCcCA0PtOjikhj1jFbYc6vqR-t+7xSTe=7qKTvCNmdjqQ@mail.gmail.com>
On Mon, Dec 8, 2014 at 7:43 PM, Mark Nottingham <mnot@mnot.net> wrote:

> Hi Noah,
>
> > On 9 Dec 2014, at 11:57 am, Noah Mendelsohn <nrm@arcanedomain.com>
> wrote:
> >
> > I'm really delighted to see you undertaking this: a very important topic
> and just the sort of thing the TAG should be doing IMO.
>
> Thanks, I agree (obviously).
>
>
> > I didn't see an indication of where comments should go, so I'll make two
> here:
>
> On-list or in the repo's issues list are the natural places, I think.
>
>
> > I. Caching and proxies
> >
> > I would love to see a really balanced analysis of whatever you discover
> to be the key tradeoffs involving caching. E.g. where exactly will caching
> capability likely be lost and in which such places will the loss be
> painful? Will the continued need for caching lead to changes in deployment
> of keys, certs and endpoints, if those are the right terms. In other words,
> when will the need for caching resulting in a cache node acting as a
> decrypting "man in the middle", when it might not otherwise. How about
> things like deep packet inspection (which seems to have seem clearly
> laudable uses, e.g. for routing incoming traffic and some more
> controversial uses.)
> >
> > So many HTTP features and so much of the Web's early deployment focused
> on making proxies and caching effective. No doubt that's become somewhat
> less important as links have gotten cheaper and faster, but it would be
> great to see a balanced exploration of the tradeoffs as they stand. If the
> result of that analysis is that HTTPs is mostly practical and desirable,
> then all the better.
>
> Very much agreed. There's a lot of data here, and I was reluctant to
> overload the document with too much detail (yet). It might end up in a
> separate document.
>

It would be good to have some clearer discussion of caching in the main
document. Presently there is a reference to "content optimization", but
it's not very clear whether this includes transparent caching. I think the
impact of HTTPS on ISP transparent caching should be clearly acknowledged
and the TAG should explain their rationale for accepting this as a
consequence of the proposed transition.


>
> Some points that I find interesting, off the top of my head (apologies for
> the dump):
>
> * It's long been observed that many aspects of shared Web caching roughly
> follow a Zipf curve; there are a comparatively VERY small number of popular
> cacheable responses creating the bulk of traffic, followed by a very long
> tail. In the past ~two years, much of the "head" has already been
> encrypted, with things like Facebook, Twitter, Google, Yahoo!, etc. taking
> the lead. Anecdotal evidence suggests that shared cache hit rates have
> fallen at least partially as a result of this (other possible factors: more
> dynamic sites, decreasing trust in caches), since they're left with just
> "tail." If we assume that those sites aren't going to be going back to
> unencrypted connections (i.e., they're a dead loss), we're left with the
> remaining sites, many of which don't get great service from shared caching
> anyway (due to where they are on the curve).


> So, one question to ask is whether encrypting the tail is going to be any
> worse than what we've already seen in the head, from the standpoint of
> getting value out of shared proxy caching. My suspicion is "not even close."
>
> * Much of that "head" encrypted traffic is still being cached, but by
> reverse proxies (CDNs, "HTTP accelerators" and the like) rather than
> traditional "forward" proxies. This trend has been going on for a much
> longer time; content providers want to maintain control of their content,
> and want repeatable performance; an intermediary deployed by them (or on
> their behalf) does that, while an intermediary deployed on behalf of the
> network acts on behalf of the network (sometimes doing things like caching
> longer than the freshness lifetime, changing responses, etc.).
>

We have frequently observed transparent prox
y caches
which do not respect
the 
HTTP specification
, sometimes
 even
modifying 
message bodies.
 Some of these things are deliberate, some are bugs or mis-configuration
but either way, these things cause service problems, customer service calls
etc. and are very hard to debug: site owners are left having to reverse
engineer a black box in the ISPs network with only whatever diagnostics
their clients return to them.


...Mark



> In other words, I strongly suspect that the apparent loss of shared cache
> efficiency in proxies is more than made up for by shared cache efficiency
> in gateways (aka "reverse proxies" of various sorts) -- if you're just
> worried about load on the origin server, its Internet connectivity and the
> backhaul to wherever the reverse proxy is.
>
> * A major caveat here is locality to the end user. In the general case, a
> forward proxy will be closer to the end user than a reverse proxy (although
> there's a lot of variance on both sides), meaning it's saving stress on the
> user's provider network more often. On the other hand, hit rates in the
> former are usually top out at about 30%, whereas the latter see upwards of
> 95% (or even 99%) in many cases.
>
> * Another caveat is locality in space+time; e.g., when everyone in an
> office visits a Web page, or downloads some software (again, assuming that
> the content is actually cacheable). However, in many cases this traffic
> isn't served out of a proxy cache today (because one isn't deployed, or the
> response isn't cacheable, or...).
>
> * After noticing the above, a natural thought is to consider schemes where
> data is encrypted / signed and cached, perhaps discovered through some p2p
> scheme. However, these invariably leak data about what's being browsed, and
> are therefore probably a non-starter; this sort of approach has roughly the
> same properties as SRI used for caching, in that you maintain integrity and
> authentication, but lose confidentiality (unless you go down the route of
> something like <
> https://en.wikipedia.org/wiki/Private_information_retrieval>, but AFAIK
> that's not anywhere near ready for production).
>
> It's attractive to consider introducing these with very limited scope
> (e.g., explicit buy-in to shared caching on the origin side as well as the
> client), but it makes things considerably more complex to do so (both
> because you need something like markup support, as well as making the
> security model more complex for the user). My gut feeling is that it'll be
> difficult to get real value / network effects here. Would still love to see
> an attempt.
>
> * The example of a village with poor access (e.g., in Africa) has
> regularly been brought up in the IETF as an example of a population who
> want shared caching, rather than encryption. The (very strong) response
> from folks who have actually worked with and surveyed such people has just
> as regularly been that many of these people value security and privacy more.
>
> * DPI and other proxy-ish (not cache) use cases are a completely different
> thing -- what you're really asking about is the value of intermediation,
> not just shared caching. One place to start here: <
> http://tools.ietf.org/html/draft-hildebrand-middlebox-erosion-01>. Note
> that the primary author is a member of the IAB, FWIW.
>
> * That leads pretty naturally to a discussion of the priority of
> constituencies, as defined by HTML5 <
> http://www.w3.org/TR/html-design-principles/#priority-of-constituencies>
> -- it'd be interesting to apply here and maybe make it a wider discussion
> among the W3C (we've already started putting our foot into this water in
> the IETF: <
> http://tools.ietf.org/html/draft-nottingham-stakeholder-rights-00>).
>
> * Finally, with all of that said - networks definitely have a role to
> play, and there has been a fair amount of discussion in the IETF and
> elsewhere as to how they can manage their costs and meet reasonable goals
> without impinging upon security. This discussion is very much in its
> infancy, and there are many tricky problems (e.g., setting sane defaults,
> security user experience (or the lack thereof)). There are a number of ways
> that such efforts might get traction, but I'm really reluctant to include
> anything along these lines in the finding, both because we've already seen
> a number of false starts, and because the process is turning out to be
> (surprise) quite political.
>
>
> > II. Privacy
> >
> > I also have the vague impression that there is a loss of privacy that
> indirectly results from the reduced practicality of proxies, but I'm not
> sure that intuition is correct. If there are privacy issues with the HTTPs
> transition, that would be worth exploring too.
>
> Love to hear more if you can triangulate.
>
>
> > Thank you. Good luck with this!
>
> Thanks!
>
>
> > Noah
> >
> > On 12/8/2014 6:28 PM, Mark Nottingham wrote:
> >> We've started work on a new Finding, to a) serve as a Web version of
> the IAB statement, and b) support the work on Secure Origins in WebAppSec.
> >>
> >> See: <https://w3ctag.github.io/web-https/>
> >>
> >> Repo w/ issues list at <https://github.com/w3ctag/web-https>.
> >>
> >> Cheers,
> >>
> --
> Mark Nottingham   https://www.mnot.net/
>
>
>
Received on Monday, 15 December 2014 16:40:04 UTC