Re: why not multiple, short-lived HTTP/2 connections? from bizzbyster@gmail.com on 2014-07-01 (ietf-http-wg@w3.org from July to September 2014)

From: <bizzbyster@gmail.com>
Date: Tue, 1 Jul 2014 16:56:19 -0400
To: William Chan (陈智昌) <willchan@chromium.org>
Cc: Mike Belshe <mike@belshe.com>, HTTP Working Group <ietf-http-wg@w3.org>
Message-Id: <796F2122-F12A-4716-9028-27220D3A962C@gmail.com>
Comments inline.

On Jul 1, 2014, at 3:21 PM, William Chan (陈智昌) <willchan@chromium.org> wrote:

> On Tue, Jul 1, 2014 at 11:52 AM, <bizzbyster@gmail.com> wrote:
> Comments inline.
> 
> On Jul 1, 2014, at 12:53 AM, William Chan (陈智昌) <willchan@chromium.org> wrote:
> 
>> On Mon, Jun 30, 2014 at 11:58 AM, <bizzbyster@gmail.com> wrote:
>> I created this page to simulate the effect of multiple HTTP/2 connections that are never idle. It was just synthesized to prove a point. 
>> 
>> And my point is it doesn't mirror real webpages :)
>> 
>> 
>> "poor prioritization, poor header compression, bypassing slow start (sometimes massively so), increased buffer bloat (you're just going to hurt the responsiveness of all the other real-time applications on the shared links), increased server costs, etc etc."
>> 
>> This is a long list but if examined closely I can make a case or at least question the importance of each one.
>> 
>> Poor prioritization: the HOL blocking of a single connection means that the good prioritization decision on the server who is doing the sending is defeated by the poor prioritization in the network who will always have to deliver pieces of objects in the order they are sent, regardless the priority of the object.
>> 
>> Defeats is the wrong way to characterize it. HOL blocking doesn't affect the prioritization. The priority ordering remains the same. HOL blocking "just" stalls the entire connection :P And putting more things on a single connection that can get stalled increases fragility due to fate sharing, I'll totally grant you that. Putting things on separate connections on the other hand definitely *does* defeat prioritization. TCP congestion control tries to be "fair" between connections, whereas we want to prioritize instead of being fair. There's definitely a tradeoff here. But saying prioritization is "defeated" is wrong.
> 
> Okay I'll put it another way. Over high bandwidth connections, I'd much rather put a high priority object on its own connection so that a dropped packet on a lower priority object in front of it cannot stall my high priority object.
> 
> So, there are a few problems here. How do you know that you have a high bandwidth connection? It's easier to know you have a low bandwidth connection, but even if you have Google Fiber, the bottleneck link is probably somewhere else then.  Especially when the content provider doesn't pay for the internet fast lanes :P So for that path, you need to spend roundtrips detecting available bandwidth. And then you assert that, for high bandwidth links, you'd rather have contention than risk HOL blocking due to dropped packets. This is definitely a tradeoff we've tested, and we don't like that tradeoff in general. That's why all modern browsers will actually choose not to request low priority resources in order to reduce contention before first paint: https://insouciant.org/tech/throttling-subresources-before-first-paint/. Yes, this can be worse for the PLT metric, but it's substantially better for the user experience metrics. So we do it.

Not requesting low priority resources is not a tradeoff between risk of contention and risk of HOL blocking. It decreases the risk for both because you have fewer object transfers on the wire at any given time. This makes sense for low bandwidth paths like for instance the 1.5 Mbps link you show in your blog post. In general, in the low bandwidth case, congestion on the link is a bigger problem than concurrency and so it makes sense to hold back that low priority resource over the 1.5 Mbps connection in your test. However if I'm talking to a US server from Australia over FIOS, or using a 20 Mbps satellite broadband connection while flying JetBlue, where my bandwidth and latency are both very high, and therefore the price of waiting to issue requests is very high, then I'm going to want to get all those requests out there as soon as I can.

I agree that it is easier to know when you have a low bandwidth connection. And when that is the case it's important to use that information to optimize the rendering of the page accordingly by minimizing congestion risk. But there are also times when we can be reasonably confident that our last mile access link is very fast and, though we risk bottlenecks elsewhere on the path, my tests show that it's well worth it to prioritize concurrency over congestion in this case.

> 
> 
>> 
>> And take a look at how huge this prioritization win can be: https://plus.google.com/+ShubhiePanicker/posts/Uw87yxQFCfY. This is something that doesn't get captured in a PLT metric, but in the application specific user experience metrics, can be enormous.
>> 
>> 
>> Poor header compression: the most important aspect of header compression from a PLT perspective as I understand it is the ability to issue a large number of GETs in the first round trip out of a new HTTP/2 connection. But if I open a second connection, I get 2x the initcwnd in the upload direction. So even though my overall compression may suffer, more connections enables me to offset the compression loss by a larger effective initcwnd.
>> 
>> I think this is a fair point. But the bytes saved here are important too. Client upstream links are often more congested, and sending more bytes is suboptimal from a congestion and bufferbloat perspective, not to mention mobile metered data.
>> 
>> 
>> Bypassing slow start: for high latency connections, slow start is too slow for tiny web files. Why else are Google servers caching CWNDs to start at 30+ as you note in your blog? Another way to achieve this is to open multiple connections.
>> 
>> Caching CWNDs is a more dynamic approach that tries to guess at a better starting value for slow start. Having the client simply open up multiple connections in order to hog more of the available bandwidth is less friendly to the ecosystem.
> 
> Does caching CWND detect different access network characteristics and adjust accordingly?
> 
> Caching means storing a value for reuse. There's no adjustment. When you switch connections, one would flush the cached value and hopefully get a new value from the server for use on a subsequent visit.
>  
> 
>> 
>> 
>> Buffer bloat: HTTP/2 will cause even performance optimized browsers to use fewer connections so this issue will be reduced with HTTP/2. I admit I don't understand this issue well enough to comment on it further.
>> 
>> Increased server costs: This is a difficult thing to argue but hardware costs are always decreasing and improved PLT can arguably offset these increased costs. Again, HTTP/2 multiplexing definitely should result in FEWER connections. Just not 1 if we are trying to optimize PLT.
>>  
>> 
>> Peter
>> 
>> On Jun 30, 2014, at 2:22 PM, William Chan (陈智昌) <willchan@chromium.org> wrote:
>> 
>>> Peter, this test page is too synthetic. It has no script, nor stylesheets. It doesn't have resource discovery chains (e.g. script loading other resources). This test is purely about PLT, which is well documented now to be a suboptimal metric (http://www.stevesouders.com/blog/2013/05/13/moving-beyond-window-onload/). Prioritization is useless if you only choose resources that all receive the same priority.
>>> 
>>> I think you're right that if all you're measuring is PLT, and you have significant transport bottlenecks, that SPDY & HTTP/2 can be slower. Opening multiple connections has its own set of significant issues though: poor prioritization, poor header compression, bypassing slow start (sometimes massively so), increased buffer bloat (you're just going to hurt the responsiveness of all the other real-time applications on the shared links), increased server costs, etc etc.
>>> 
>>> Pushing for a single connection is the _right_ thing to do in so many ways. There might be specific exceptional cases where a single TCP connection is still slower. We should fix TCP, and more generally the transport, so these issues mostly go away.
>>> 
>>> On Mon, Jun 30, 2014 at 10:55 AM, <bizzbyster@gmail.com> wrote:
>>> Comments inline.
>>> 
>>> On Jun 30, 2014, at 1:00 PM, Mike Belshe <mike@belshe.com> wrote:
>>> 
>>>> 
>>>> 
>>>> 
>>>> On Mon, Jun 30, 2014 at 9:04 AM, <bizzbyster@gmail.com> wrote:
>>>> All,
>>>> 
>>>> Another huge issue is that for some reason I still see many TCP connections that do not advertise support for window scaling in the SYN packet. I'm really not sure why this is but for instance WPT test instances are running Windows 7 and yet they do not advertise window scaling and so TCP connections max out at a send window of 64 KB. I've seen this in tests run out of multiple different WPT test locations.
>>>> 
>>>> The impact of this is that high latency connections max out at very low throughputs. Here's an example (with tcpdump output so you can examine the TCP flow on the wire) where I download data from a SPDY-enabled web server in Virginia from a WPT test instance running in Sydney: http://www.webpagetest.org/result/140629_XG_1JC/1/details/. Average throughput is not even 3 Mbps despite the fact that I chose a 20 Mbps FIOS connection for my test. Note that when I disable SPDY on this web server, I render the page almost twice as fast because I am using multiple connections and therefore overcoming the per connection throughput limitation: http://www.webpagetest.org/result/140629_YB_1K5/1/details/.
>>>> 
>>>> I don't know the root cause (Windows 7 definitely sends windows scaling option in SYN in other tests) and have sent a note to the webpagetest.org admin but in general there are reasons why even Windows 7 machines sometimes appear to not use Windows scaling, causing single connection SPDY to perform really badly even beyond the slow start phase.
>>>> 
>>>> I believe your test is invalid.
>>>> 
>>>> The time-to-first-byte (single connection) in the first case is 1136ms.  The time-to-first-byte is 768ms.  They should have been identical, right?  What this really means is that you're testing over the public net, and your variance is somewhere in the 50% range.  The key to successful benchmarking is eliminating variance and *lots* of iterations.
>>> 
>>> I've re-run this 5 times with SPDY on and a few times with SPDY off and the result is always about 2x slower with SPDY.  Here are the 5 with SPDY off:
>>> 
>>> http://www.webpagetest.org/result/140629_XG_1JC/
>>> http://www.webpagetest.org/result/140630_54_T4T/
>>> http://www.webpagetest.org/result/140630_B0_T9N/
>>> http://www.webpagetest.org/result/140630_97_TAN/
>>> http://www.webpagetest.org/result/140630_XM_TCF/
>>> 
>>> The reason is because no window scaling is happening on the TCP connection and so the SPDY case can only run at 64KB / 230 ms or about 2.2 Mbps. In the non-SPDY case I download the 6 images over 6 connections, each operating at 2.2 Mbps. In the SPDY disabled case, max throughput is 6 connections X 2.2 Mbps, which is about 19 Mbps. The webpagetest throughput curves illustrate this difference.
>>> 
>>> 
>>>> 
>>>> I'm not saying you aren't seeing a trend.  You might be, but this test doesn't show it.  But variance is exacerbating the issue, the benchmark transport is suspect (for reasons you've already identified), and I further suspect that you're using a funky server transport connection (what is your initial CWND?  what is your initial server recv buffer?)
>>>> 
>>>> I also looked at the TCP dumps.  Here are some random notes:
>>>>    * Looks like your SSL is using 16KB record sizes.  That will cause delays.
>>>>    * I'm not seeing any packet loss (haven't used cloudshark before, so maybe I'm not looking carefully enough).  If there is no packet loss, and we're not bandwidth constrained, then this is just test-environment related, right?
>>>>    * Notice that the time-to-first render (which is the important metric for a page which is all images) makes the case you call the slow case faster (yes, its 1.383s instead of 1.026s, but remember it had a 368ms handicap (1136-768ms due to variance mentioned above)
>>>>     * there is noise in this test.  the first case sent 88KB/109pkts to 74.125.237.183 which the second test didn't send.  I believe this is the browser doing something behind your back but interfering with your test.  Not sure.
>>>>     * what is the cwnd on your server?
>>>> 
>>>> Overall, I can't explain why you're seeing a different result in this test, except to point out all the variances which make the test look suspect.
>>> 
>>> Thanks for taking a look. In the TCP dumps, look at the Ack packets during the data transfer you'll see "no window scaling used" and an advertised window of 64 KB. This is the problem.
>>> 
>>>> 
>>>> 
>>>>  
>>>> 
>>>> SPDY/HTTP2 is supposed to be about faster. Let's encourage browser developers to make use of the new protocol to make web browsing as fast as possible -- and not limit them to one connection and therefore essentially ask them to do so with one hand tied behind their backs.
>>>> 
>>>> Again, after all the testing I did, it was clear that single connection is the best route.
>>> 
>>> I suspect this is because you were not able to keep multiple connections from being idle and they fell back to initcwnd due to slow start when idle. It's the only explanation I can think of.
>>> 
>>>> 
>>>> I didn't reply earlier about packet loss, but the packet loss simulators generally issue "random" packet loss.  This is different from "correlated" packet loss like a buffer tail drop in a router.  With the former, a single-connection protocol gets screwed, because the cwnd gets cut in half on a single connection, while a 2-connection load would only have its cwnd cut by 1/4 with a single loss.  But in *correlated* loss modeling, we see *all* connections get packet loss at the same time, because the entire buffer was lost in the router.  The net result is that if you simulate tail drops, you tend to see the cwnd collapse due to packet lose be much more similar in single and multi connection policies.
>>> 
>>> True. Sometimes connections from all packets are dropped, in which case multiple connections acts like a single connection. But otherwise multiple connections is more robust to packet loss, as you point out.
>>> 
>>>> 
>>>> Mike
>>>> 
>>>> 
>>>>  
>>>> 
>>>> Thanks,
>>>> 
>>>> Peter
>>>> 
>>>> 
>>>> On Jun 25, 2014, at 5:27 PM, bizzbyster@gmail.com wrote:
>>>> 
>>>>> Responses inline.
>>>>> 
>>>>> On Jun 25, 2014, at 11:47 AM, Mike Belshe <mike@belshe.com> wrote:
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, Jun 25, 2014 at 7:56 AM, <bizzbyster@gmail.com> wrote:
>>>>>> Thanks for all the feedback. I'm going to try to reply to Mike, Greg, Willy, and Guille in one post since a few of you made the same or similar points. My apologies in advance for the very long post.
>>>>>> 
>>>>>> First, you should understand that I am building a browser and web server that use the feedback loop described here (http://caffeinatetheweb.com/baking-acceleration-into-the-web-itself/) to provide the browser with a set of hints inserted into the html that allow it to load the page much faster. I prefer subresource hints to server push because A) it works in coordination with the browser cache state and B) hints can be supplied for resources found on third party domains. But my hints also go beyond just supplying the browser with a list of URLs to fetch: http://lists.w3.org/Archives/Public/public-web-perf/2014Jun/0044.html. They also include an estimate of the size of the objects for instance.
>>>>>> 
>>>>>> Okay so this is all relevant because it means that I often know the large number of objects (sometimes 50+) I need to fetch from a given server up front and therefore have to figure out the optimal way to retrieve these objects. Unlike Mike's tests, my tests have shown that a pool with multiple connections is faster than a single one, perhaps because my server hints allow me to know about a much larger number of URLs up front and because I often have expected object sizes.
>>>>>> 
>>>>>> With appropriate hand-crafted and customized server hints, I'm not surprised that you can outpace a single connection in some scenarios.
>>>>>> 
>>>>>> But, the answer to "which is faster" will not be a boolean yes/no - you have to look at all sorts of network conditions, including link speed, RTT, and packet loss. 
>>>>>> 
>>>>>> The way I chose how to optimize was based on studying how networks are evolving over time:
>>>>>>    a) we know that bandwidth is going up fairly rapidly to end users
>>>>>>    b) we know that RTT is not changing much, and in some links going up
>>>>>>    c) packet loss is getting better, but is very difficult to pin down & even harder to appropriately model (tail drops vs random, etc)
>>>>>> 
>>>>>> So I chose to optimize assuming that (b) will continue to hold true.  In your tests, you should try jacking the RTT up to 100ms+.  Average RTT to Google is ~100ms (my data slightly old here).  If you're on a super fast link, then sure, initcwnd will be your bottleneck, because RTT is not a factor.  What RTTs did you simulate?  I'm guessing you were using a high-speed local network?
>>>>> 
>>>>> We are testing at many different latencies (including satellite) but also the benefit of multiple connections actually increases as RTT increases because connections are in a slow start bottlenecked state for longer for higher latency links. After 4 RTTs a single connection will be able to transfer roughly initcwnd*(2)^4 in the next round trip whereas 3 connections will be able to transfer 3*initcwnd*(2)^4, assuming we can keep all three connections full and we are not bandwidth limited. 
>>>>> 
>>>>> In addition, multiple connections are more robust to packet loss. If we drop a packet after 4 RTTs, the overall throughput for the single connection case drops back down to its initial rate before exponentially growing again whereas only one connection in the 3 connection case will fall back to initcwnd. 
>>>>>> 
>>>>>> Overall, the single connection has some drawbacks on some networks, but by and large it works better while also providing real server efficiencies and finally giving the transport the opportunity to do its job better.  When we split onto zillions of connections, we basically sidestep all of the transport layer's goodness.  (This is a complex topic too, however).
>>>>>>  
>>>>>> If I need to fetch 6 small objects (each the size of a single full packet) from a server that has an initcwnd of 3,
>>>>>> 
>>>>>> Why use a server with cwnd of 3?  Default linux distros ship with 10 today (and have done so for like 2 years).
>>>>> 
>>>>> Understood. 3 is the value mentioned in Ops Guide so I just used that for my example. But the argument applies to 10 as well. The main thing is that if we know the size of the file we can do really clever things to make connection pools download objects really fast.
>>>>> 
>>>>>> 
>>>>>> 
>>>>>>  
>>>>>> I can request 3 objects on each of two connections and download those objects in a single round trip. This is not a theoretical idea -- I have tested this and I get the expected performance. In general, a pool of cold HTTP/2 connections is much faster than a single cold connection for fetching a large number of small objects, especially when you know the size up front. I will share the data and demo as soon as I'm able to.
>>>>>> 
>>>>>> Many have tested this heavily too, so I believe your results.  My own test data fed into https://developers.google.com/speed/articles/tcp_initcwnd_paper.pdf
>>>>>> 
>>>>>> 
>>>>>> Since I know that multiple connections is faster, I can imagine a solution that web performance optimizers will resort to if browsers only support one connections per host: domain sharding! Let's avoid this by removing the SHOULD NOT from the spec.
>>>>>> 
>>>>>> I think we need to get into the nitty gritty benchmarking details if we want to claim that a single connection is faster.  I highly doubt this is true for all network types.
>>>>> 
>>>>> For non-bandwidth limited links, the improvement for multiple connections should increase with increasing latency and increasing packet loss. So I have trouble understanding how a single connection could ever win. I wonder if in your tests you were not able to keep your multiple connections full b/c that could result in slow start on idle kicking in for those connections. Is that possible?
>>>>> 
>>>>>> 
>>>>>>  
>>>>>> 
>>>>>> "Servers must keep open idle connections, making load balancing more complex and creating DOS vulnerability." A few of you pointed out that the server can close them. That's true. I should not have said "must". But Mark's Ops Guide suggests that browsers will aggressively keep open idle connections for performance reasons, and that servers should support this by not closing these connections. And also servers should keep those connections fast by disabling slow start after idle. In my opinion, browsers should keep connections open only as long as they have the expectation of imminent requests to issue on those connections, which is essentially the way that mainstream browsers handle connection lifetimes for HTTP/1.1 connections today. We should not create an incentive for browsers to hold on to connections for longer than this and to encourage servers to support longer lived idle connections than they already do today.
>>>>>> 
>>>>>> What we really need is just a better transport.  We should have 'forever connections'.  The idea that the endpoints need to maintain state to keep connections open is so silly; session resumption without a round-trip is very doable.   I believe QUIC doe this :-)
>>>>> Agreed -- I can't wait to see QUIC in action.
>>>>> 
>>>>>>  
>>>>>> 
>>>>>> Some of you pointed out that a single connection allows us to get back to fair congestion control. But TCP slow start and congestion control are designed for transferring large objects. They unfairly penalize applications that need to fetch a large number of small objects. Are we overflowing router buffers today b/c we are using 6 connections per host? I agree that reducing that number is a good thing, which HTTP/2 will naturally enable. But I don't see any reason to throttle web browsers down to a single slow started connection. Also again, web browser and site developers will work around this artificial limit. In the future we will see 50+ Mbps last mile networks as the norm. This makes extremely fast page load times possible, if only we can mitigate the impact of latency by increasing the concurrency of object requests. I realize that QUIC may eventually solve this issue but in the meantime we need to be able to use multiple TCP connections to squeeze the most performance out of today's web.
>>>>>> 
>>>>>> A tremendous amount of research has gone into this, and you're asking good questions to which nobody knows the exact answers.  Its not that we don't know the answers for not trying - its because there are so many combinations of network equipment, speeds, configs, etc, in the real world that all real world data is a mix of errors.  Given the research that has gone into it, I wouldn't expect these answers to come crisply or quickly.
>>>>>> 
>>>>>> I agree we'll have 50+Mbps in the not-distant future.  But so far, there is no evidence that we're figuring out how to bring RTT's down.  Hence, more bandwidth doesn't matter much:  https://docs.google.com/a/chromium.org/viewer?a=v&pid=sites&srcid=Y2hyb21pdW0ub3JnfGRldnxneDoxMzcyOWI1N2I4YzI3NzE2
>>>>> 
>>>>> If we can increase concurrency, via dynamically generated server hints, then we can start seeing increases in bandwidth show up as increases in page load time again -- we can make bandwidth matter again. Server hints also allows us to keep our multiple connections full so we can avoid the slow start on idle issue without requiring a config change on servers.
>>>>> 
>>>>>> 
>>>>>> Mike
>>>>>> 
>>>>>> 
>>>>>>  
>>>>>> 
>>>>>> Thanks for reading through all this,
>>>>>> 
>>>>>> Peter
>>>>>> 
>>>>>> On Jun 24, 2014, at 3:55 PM, Mike Belshe <mike@belshe.com> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Jun 24, 2014 at 10:50 AM, <bizzbyster@gmail.com> wrote:
>>>>>>> I've raised this issue before on the list but it's been a while and reading Mark's ops guide doc (https://github.com/http2/http2-spec/wiki/Ops) I'm reminded that requiring the use of a single connection for HTTP/2 ("Clients SHOULD NOT open more than one HTTP/2 connection") still makes no sense to me. Due to multiplexing, HTTP/2 will naturally use FEWER connections than HTTP/1, which is a good thing, but requiring a single connection has the following drawbacks:
>>>>>>> 
>>>>>>> Servers must keep open idle connections, making load balancing more complex and creating DOS vulnerability.
>>>>>>> 
>>>>>>> As others have mentioned, you don't have to do this. 
>>>>>>> Servers must turn off tcp_slow_start_after_idle in order for browsers to get good performance, again creating DOS vulnerability.
>>>>>>> You also don't have to do this; it will drop back to init cwnd levels if you do, just as though you had opened a fresh connection.
>>>>>>>  
>>>>>>> The number of simultaneous GET requests I'm able to upload in the first round trip is limited to the compressed amount that can fit in a single initcwnd. Yes compression helps with this but if I use multiple connections I will get the benefit of compression for the requests on the same connection, in addition to having multiple initcwnds!
>>>>>>> It turns out that a larger initcwnd just works better anyway - there was a tremendous amount of evidence supporting going up to 10, and that was accepted at in the transport level already.
>>>>>>>  
>>>>>>> The amount of data I'm able to download in the first round trip is limited to the amount that can fit in a single initcwnd.
>>>>>>> It turns out the browser doesn't really know how many connections to open until that first resource is downloaded anyway.  Many out-of-band tricks exist.
>>>>>>>  
>>>>>>> Head of line blocking is exacerbated by putting all objects on a single connection.
>>>>>>> Yeah, this is true.  But overall, its still faster and more efficient.
>>>>>>> 
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>> Multiple short-lived HTTP/2 connections gives us all the performance benefits of multiplexing without any of the operational or performance drawbacks. As a proxy and a browser implementor, I plan to use multiple HTTP/2 connections when talking to HTTP/2 servers because it seems like the right thing to do from a performance, security, and operational perspective.
>>>>>>> 
>>>>>>> When I tested the multi-connection scenarios they were all slower for me.  In cases of severe packet loss, it was difficult to discern as expected.  But overall, the reduced server resource use and the efficiency outweighed the negatives.
>>>>>>> 
>>>>>>> Mike
>>>>>>> 
>>>>>>>  
>>>>>>> 
>>>>>>> I know it's very late to ask this but can we remove the "SHOULD NOT" statement from the spec? Or, maybe soften it a little for those of us who cannot understand why it's there?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Peter
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>
Received on Tuesday, 1 July 2014 20:56:50 UTC