Re: Draft finding - "Transitioning the Web to HTTPS"

Hi Noah,

> On 9 Dec 2014, at 11:57 am, Noah Mendelsohn <nrm@arcanedomain.com> wrote:
> 
> I'm really delighted to see you undertaking this: a very important topic and just the sort of thing the TAG should be doing IMO.

Thanks, I agree (obviously).


> I didn't see an indication of where comments should go, so I'll make two here:

On-list or in the repo's issues list are the natural places, I think.


> I. Caching and proxies
> 
> I would love to see a really balanced analysis of whatever you discover to be the key tradeoffs involving caching. E.g. where exactly will caching capability likely be lost and in which such places will the loss be painful? Will the continued need for caching lead to changes in deployment of keys, certs and endpoints, if those are the right terms. In other words, when will the need for caching resulting in a cache node acting as a decrypting "man in the middle", when it might not otherwise. How about things like deep packet inspection (which seems to have seem clearly laudable uses, e.g. for routing incoming traffic and some more controversial uses.)
> 
> So many HTTP features and so much of the Web's early deployment focused on making proxies and caching effective. No doubt that's become somewhat less important as links have gotten cheaper and faster, but it would be great to see a balanced exploration of the tradeoffs as they stand. If the result of that analysis is that HTTPs is mostly practical and desirable, then all the better.

Very much agreed. There's a lot of data here, and I was reluctant to overload the document with too much detail (yet). It might end up in a separate document.

Some points that I find interesting, off the top of my head (apologies for the dump):

* It's long been observed that many aspects of shared Web caching roughly follow a Zipf curve; there are a comparatively VERY small number of popular cacheable responses creating the bulk of traffic, followed by a very long tail. In the past ~two years, much of the "head" has already been encrypted, with things like Facebook, Twitter, Google, Yahoo!, etc. taking the lead. Anecdotal evidence suggests that shared cache hit rates have fallen at least partially as a result of this (other possible factors: more dynamic sites, decreasing trust in caches), since they're left with just "tail." If we assume that those sites aren't going to be going back to unencrypted connections (i.e., they're a dead loss), we're left with the remaining sites, many of which don't get great service from shared caching anyway (due to where they are on the curve).

So, one question to ask is whether encrypting the tail is going to be any worse than what we've already seen in the head, from the standpoint of getting value out of shared proxy caching. My suspicion is "not even close."

* Much of that "head" encrypted traffic is still being cached, but by reverse proxies (CDNs, "HTTP accelerators" and the like) rather than traditional "forward" proxies. This trend has been going on for a much longer time; content providers want to maintain control of their content, and want repeatable performance; an intermediary deployed by them (or on their behalf) does that, while an intermediary deployed on behalf of the network acts on behalf of the network (sometimes doing things like caching longer than the freshness lifetime, changing responses, etc.).

In other words, I strongly suspect that the apparent loss of shared cache efficiency in proxies is more than made up for by shared cache efficiency in gateways (aka "reverse proxies" of various sorts) -- if you're just worried about load on the origin server, its Internet connectivity and the backhaul to wherever the reverse proxy is.

* A major caveat here is locality to the end user. In the general case, a forward proxy will be closer to the end user than a reverse proxy (although there's a lot of variance on both sides), meaning it's saving stress on the user's provider network more often. On the other hand, hit rates in the former are usually top out at about 30%, whereas the latter see upwards of 95% (or even 99%) in many cases.

* Another caveat is locality in space+time; e.g., when everyone in an office visits a Web page, or downloads some software (again, assuming that the content is actually cacheable). However, in many cases this traffic isn't served out of a proxy cache today (because one isn't deployed, or the response isn't cacheable, or...).

* After noticing the above, a natural thought is to consider schemes where data is encrypted / signed and cached, perhaps discovered through some p2p scheme. However, these invariably leak data about what's being browsed, and are therefore probably a non-starter; this sort of approach has roughly the same properties as SRI used for caching, in that you maintain integrity and authentication, but lose confidentiality (unless you go down the route of something like <https://en.wikipedia.org/wiki/Private_information_retrieval>, but AFAIK that's not anywhere near ready for production).

It's attractive to consider introducing these with very limited scope (e.g., explicit buy-in to shared caching on the origin side as well as the client), but it makes things considerably more complex to do so (both because you need something like markup support, as well as making the security model more complex for the user). My gut feeling is that it'll be difficult to get real value / network effects here. Would still love to see an attempt.

* The example of a village with poor access (e.g., in Africa) has regularly been brought up in the IETF as an example of a population who want shared caching, rather than encryption. The (very strong) response from folks who have actually worked with and surveyed such people has just as regularly been that many of these people value security and privacy more. 

* DPI and other proxy-ish (not cache) use cases are a completely different thing -- what you're really asking about is the value of intermediation, not just shared caching. One place to start here: <http://tools.ietf.org/html/draft-hildebrand-middlebox-erosion-01>. Note that the primary author is a member of the IAB, FWIW.

* That leads pretty naturally to a discussion of the priority of constituencies, as defined by HTML5 <http://www.w3.org/TR/html-design-principles/#priority-of-constituencies> -- it'd be interesting to apply here and maybe make it a wider discussion among the W3C (we've already started putting our foot into this water in the IETF: <http://tools.ietf.org/html/draft-nottingham-stakeholder-rights-00>).

* Finally, with all of that said - networks definitely have a role to play, and there has been a fair amount of discussion in the IETF and elsewhere as to how they can manage their costs and meet reasonable goals without impinging upon security. This discussion is very much in its infancy, and there are many tricky problems (e.g., setting sane defaults, security user experience (or the lack thereof)). There are a number of ways that such efforts might get traction, but I'm really reluctant to include anything along these lines in the finding, both because we've already seen a number of false starts, and because the process is turning out to be (surprise) quite political.


> II. Privacy
> 
> I also have the vague impression that there is a loss of privacy that indirectly results from the reduced practicality of proxies, but I'm not sure that intuition is correct. If there are privacy issues with the HTTPs transition, that would be worth exploring too.

Love to hear more if you can triangulate.


> Thank you. Good luck with this!

Thanks!


> Noah
> 
> On 12/8/2014 6:28 PM, Mark Nottingham wrote:
>> We've started work on a new Finding, to a) serve as a Web version of the IAB statement, and b) support the work on Secure Origins in WebAppSec.
>> 
>> See: <https://w3ctag.github.io/web-https/>
>> 
>> Repo w/ issues list at <https://github.com/w3ctag/web-https>.
>> 
>> Cheers,
>> 
--
Mark Nottingham   https://www.mnot.net/

Received on Tuesday, 9 December 2014 03:43:47 UTC