Re: why not multiple, short-lived HTTP/2 connections? from 陈智昌 on 2014-07-01 (ietf-http-wg@w3.org from July to September 2014)

From: 陈智昌 <willchan@chromium.org>
Date: Mon, 30 Jun 2014 21:53:52 -0700
To: Peter Lepeska <bizzbyster@gmail.com>
Cc: Mike Belshe <mike@belshe.com>, HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <CAA4WUYgDabTjRm=vNnt6+SqWX2GRbngL2AMFWrf+eoAqzjG9LA@mail.gmail.com>
On Mon, Jun 30, 2014 at 11:58 AM, <bizzbyster@gmail.com> wrote:

> I created this page to simulate the effect of multiple HTTP/2 connections
> that are never idle. It was just synthesized to prove a point.
>

And my point is it doesn't mirror real webpages :)


> "poor prioritization, poor header compression, bypassing slow start
> (sometimes massively so), increased buffer bloat (you're just going to hurt
> the responsiveness of all the other real-time applications on the shared
> links), increased server costs, etc etc."
>
> This is a long list but if examined closely I can make a case or at least
> question the importance of each one.
>
> Poor prioritization: the HOL blocking of a single connection means that
> the good prioritization decision on the server who is doing the sending is
> defeated by the poor prioritization in the network who will always have to
> deliver pieces of objects in the order they are sent, regardless the
> priority of the object.
>

Defeats is the wrong way to characterize it. HOL blocking doesn't affect
the prioritization. The priority ordering remains the same. HOL blocking
"just" stalls the entire connection :P And putting more things on a single
connection that can get stalled increases fragility due to fate sharing,
I'll totally grant you that. Putting things on separate connections on the
other hand definitely *does* defeat prioritization. TCP congestion control
tries to be "fair" between connections, whereas we want to prioritize
instead of being fair. There's definitely a tradeoff here. But saying
prioritization is "defeated" is wrong.

And take a look at how huge this prioritization win can be:
https://plus.google.com/+ShubhiePanicker/posts/Uw87yxQFCfY. This is
something that doesn't get captured in a PLT metric, but in the application
specific user experience metrics, can be enormous.


> Poor header compression: the most important aspect of header compression
> from a PLT perspective as I understand it is the ability to issue a large
> number of GETs in the first round trip out of a new HTTP/2 connection. But
> if I open a second connection, I get 2x the initcwnd in the upload
> direction. So even though my overall compression may suffer, more
> connections enables me to offset the compression loss by a larger effective
> initcwnd.
>

I think this is a fair point. But the bytes saved here are important too.
Client upstream links are often more congested, and sending more bytes is
suboptimal from a congestion and bufferbloat perspective, not to mention
mobile metered data.


> Bypassing slow start: for high latency connections, slow start is too slow
> for tiny web files. Why else are Google servers caching CWNDs to start at
> 30+ as you note in your blog? Another way to achieve this is to open
> multiple connections.
>

Caching CWNDs is a more dynamic approach that tries to guess at a better
starting value for slow start. Having the client simply open up multiple
connections in order to hog more of the available bandwidth is less
friendly to the ecosystem.


> Buffer bloat: HTTP/2 will cause even performance optimized browsers to use
> fewer connections so this issue will be reduced with HTTP/2. I admit I
> don't understand this issue well enough to comment on it further.
>
> Increased server costs: This is a difficult thing to argue but hardware
> costs are always decreasing and improved PLT can arguably offset these
> increased costs. Again, HTTP/2 multiplexing definitely should result in
> FEWER connections. Just not 1 if we are trying to optimize PLT.
>

>

> Peter
>
> On Jun 30, 2014, at 2:22 PM, William Chan (陈智昌) <willchan@chromium.org>
> wrote:
>
> Peter, this test page is too synthetic. It has no script, nor stylesheets.
> It doesn't have resource discovery chains (e.g. script loading other
> resources). This test is purely about PLT, which is well documented now to
> be a suboptimal metric (
> http://www.stevesouders.com/blog/2013/05/13/moving-beyond-window-onload/).
> Prioritization is useless if you only choose resources that all receive the
> same priority.
>
> I think you're right that if all you're measuring is PLT, and you have
> significant transport bottlenecks, that SPDY & HTTP/2 can be slower.
> Opening multiple connections has its own set of significant issues though:
> poor prioritization, poor header compression, bypassing slow start
> (sometimes massively so), increased buffer bloat (you're just going to hurt
> the responsiveness of all the other real-time applications on the shared
> links), increased server costs, etc etc.
>
> Pushing for a single connection is the _right_ thing to do in so many
> ways. There might be specific exceptional cases where a single TCP
> connection is still slower. We should fix TCP, and more generally the
> transport, so these issues mostly go away.
>
> On Mon, Jun 30, 2014 at 10:55 AM, <bizzbyster@gmail.com> wrote:
>
>> Comments inline.
>>
>> On Jun 30, 2014, at 1:00 PM, Mike Belshe <mike@belshe.com> wrote:
>>
>>
>>
>>
>> On Mon, Jun 30, 2014 at 9:04 AM, <bizzbyster@gmail.com> wrote:
>>
>>> All,
>>>
>>> Another huge issue is that for some reason I still see many TCP
>>> connections that do not advertise support for window scaling in the SYN
>>> packet. I'm really not sure why this is but for instance WPT test instances
>>> are running Windows 7 and yet they do not advertise window scaling and so
>>> TCP connections max out at a send window of 64 KB. I've seen this in tests
>>> run out of multiple different WPT test locations.
>>>
>>> The impact of this is that high latency connections max out at very low
>>> throughputs. Here's an example (with tcpdump output so you can examine the
>>> TCP flow on the wire) where I download data from a SPDY-enabled web server
>>> in Virginia from a WPT test instance running in Sydney:
>>> http://www.webpagetest.org/result/140629_XG_1JC/1/details/. Average
>>> throughput is not even 3 Mbps despite the fact that I chose a 20 Mbps FIOS
>>> connection for my test. Note that when I disable SPDY on this web server, I
>>> render the page almost twice as fast because I am using multiple
>>> connections and therefore overcoming the per connection throughput
>>> limitation: http://www.webpagetest.org/result/140629_YB_1K5/1/details/.
>>>
>>> I don't know the root cause (Windows 7 definitely sends windows scaling
>>> option in SYN in other tests) and have sent a note to the
>>> webpagetest.org admin but in general there are reasons why even Windows
>>> 7 machines sometimes appear to not use Windows scaling, causing single
>>> connection SPDY to perform really badly even beyond the slow start phase.
>>>
>>
>> I believe your test is invalid.
>>
>> The time-to-first-byte (single connection) in the first case is 1136ms.
>>  The time-to-first-byte is 768ms.  They should have been identical, right?
>>  What this really means is that you're testing over the public net, and
>> your variance is somewhere in the 50% range.  The key to successful
>> benchmarking is eliminating variance and *lots* of iterations.
>>
>>
>> I've re-run this 5 times with SPDY on and a few times with SPDY off and
>> the result is always about 2x slower with SPDY.  Here are the 5 with SPDY
>> off:
>>
>> http://www.webpagetest.org/result/140629_XG_1JC/
>> http://www.webpagetest.org/result/140630_54_T4T/
>> http://www.webpagetest.org/result/140630_B0_T9N/
>> http://www.webpagetest.org/result/140630_97_TAN/
>> http://www.webpagetest.org/result/140630_XM_TCF/
>>
>> The reason is because no window scaling is happening on the TCP
>> connection and so the SPDY case can only run at 64KB / 230 ms or about 2.2
>> Mbps. In the non-SPDY case I download the 6 images over 6 connections, each
>> operating at 2.2 Mbps. In the SPDY disabled case, max throughput is 6
>> connections X 2.2 Mbps, which is about 19 Mbps. The webpagetest throughput
>> curves illustrate this difference.
>>
>>
>>
>> I'm not saying you aren't seeing a trend.  You might be, but this test
>> doesn't show it.  But variance is exacerbating the issue, the benchmark
>> transport is suspect (for reasons you've already identified), and I further
>> suspect that you're using a funky server transport connection (what is your
>> initial CWND?  what is your initial server recv buffer?)
>>
>> I also looked at the TCP dumps.  Here are some random notes:
>>    * Looks like your SSL is using 16KB record sizes.  That will cause
>> delays.
>>    * I'm not seeing any packet loss (haven't used cloudshark before, so
>> maybe I'm not looking carefully enough).  If there is no packet loss, and
>> we're not bandwidth constrained, then this is just test-environment
>> related, right?
>>    * Notice that the time-to-first render (which is the important metric
>> for a page which is all images) makes the case you call the slow case
>> faster (yes, its 1.383s instead of 1.026s, but remember it had a 368ms
>> handicap (1136-768ms due to variance mentioned above)
>>     * there is noise in this test.  the first case sent 88KB/109pkts to 74.125.237.183
>> which the second test didn't send.  I believe this is the browser doing
>> something behind your back but interfering with your test.  Not sure.
>>     * what is the cwnd on your server?
>>
>> Overall, I can't explain why you're seeing a different result in this
>> test, except to point out all the variances which make the test look
>> suspect.
>>
>>
>> Thanks for taking a look. In the TCP dumps, look at the Ack packets
>> during the data transfer you'll see "no window scaling used" and an
>> advertised window of 64 KB. This is the problem.
>>
>>
>>
>>
>>
>>>
>>> SPDY/HTTP2 is supposed to be about faster. Let's encourage browser
>>> developers to make use of the new protocol to make web browsing as fast as
>>> possible -- and not limit them to one connection and therefore essentially
>>> ask them to do so with one hand tied behind their backs.
>>>
>>
>> Again, after all the testing I did, it was clear that single connection
>> is the best route.
>>
>> I suspect this is because you were not able to keep multiple connections
>> from being idle and they fell back to initcwnd due to slow start when idle.
>> It's the only explanation I can think of.
>>
>>
>> I didn't reply earlier about packet loss, but the packet loss simulators
>> generally issue "random" packet loss.  This is different from "correlated"
>> packet loss like a buffer tail drop in a router.  With the former, a
>> single-connection protocol gets screwed, because the cwnd gets cut in half
>> on a single connection, while a 2-connection load would only have its cwnd
>> cut by 1/4 with a single loss.  But in *correlated* loss modeling, we see
>> *all* connections get packet loss at the same time, because the entire
>> buffer was lost in the router.  The net result is that if you simulate tail
>> drops, you tend to see the cwnd collapse due to packet lose be much more
>> similar in single and multi connection policies.
>>
>> True. Sometimes connections from all packets are dropped, in which case
>> multiple connections acts like a single connection. But otherwise multiple
>> connections is more robust to packet loss, as you point out.
>>
>>
>> Mike
>>
>>
>>
>>
>>>
>>> Thanks,
>>>
>>> Peter
>>>
>>>
>>> On Jun 25, 2014, at 5:27 PM, bizzbyster@gmail.com wrote:
>>>
>>> Responses inline.
>>>
>>> On Jun 25, 2014, at 11:47 AM, Mike Belshe <mike@belshe.com> wrote:
>>>
>>>
>>>
>>>
>>> On Wed, Jun 25, 2014 at 7:56 AM, <bizzbyster@gmail.com> wrote:
>>>
>>>> Thanks for all the feedback. I'm going to try to reply to Mike, Greg,
>>>> Willy, and Guille in one post since a few of you made the same or similar
>>>> points. My apologies in advance for the very long post.
>>>>
>>>> First, you should understand that I am building a browser and web
>>>> server that use the feedback loop described here (
>>>> http://caffeinatetheweb.com/baking-acceleration-into-the-web-itself/)
>>>> to provide the browser with a set of hints inserted into the html that
>>>> allow it to load the page much faster. I prefer subresource hints to server
>>>> push because A) it works in coordination with the browser cache state and
>>>> B) hints can be supplied for resources found on third party domains. But my
>>>> hints also go beyond just supplying the browser with a list of URLs to
>>>> fetch:
>>>> http://lists.w3.org/Archives/Public/public-web-perf/2014Jun/0044.html.
>>>> They also include an estimate of the size of the objects for instance.
>>>>
>>>> Okay so this is all relevant because it means that I often know the
>>>> large number of objects (sometimes 50+) I need to fetch from a given server
>>>> up front and therefore have to figure out the optimal way to retrieve these
>>>> objects. Unlike Mike's tests, my tests have shown that a pool with multiple
>>>> connections is faster than a single one, perhaps because my server hints
>>>> allow me to know about a much larger number of URLs up front and because I
>>>> often have expected object sizes.
>>>>
>>>
>>> With appropriate hand-crafted and customized server hints, I'm not
>>> surprised that you can outpace a single connection in some scenarios.
>>>
>>> But, the answer to "which is faster" will not be a boolean yes/no - you
>>> have to look at all sorts of network conditions, including link speed, RTT,
>>> and packet loss.
>>>
>>> The way I chose how to optimize was based on studying how networks are
>>> evolving over time:
>>>    a) we know that bandwidth is going up fairly rapidly to end users
>>>    b) we know that RTT is not changing much, and in some links going up
>>>    c) packet loss is getting better, but is very difficult to pin down &
>>> even harder to appropriately model (tail drops vs random, etc)
>>>
>>> So I chose to optimize assuming that (b) will continue to hold true.  In
>>> your tests, you should try jacking the RTT up to 100ms+.  Average RTT to
>>> Google is ~100ms (my data slightly old here).  If you're on a super fast
>>> link, then sure, initcwnd will be your bottleneck, because RTT is not a
>>> factor.  What RTTs did you simulate?  I'm guessing you were using a
>>> high-speed local network?
>>>
>>>
>>> We are testing at many different latencies (including satellite) but
>>> also the benefit of multiple connections actually increases as RTT
>>> increases because connections are in a slow start bottlenecked state for
>>> longer for higher latency links. After 4 RTTs a single connection will be
>>> able to transfer roughly initcwnd*(2)^4 in the next round trip whereas 3
>>> connections will be able to transfer 3*initcwnd*(2)^4, assuming we can keep
>>> all three connections full and we are not bandwidth limited.
>>>
>>> In addition, multiple connections are more robust to packet loss. If we
>>> drop a packet after 4 RTTs, the overall throughput for the single
>>> connection case drops back down to its initial rate before exponentially
>>> growing again whereas only one connection in the 3 connection case will
>>> fall back to initcwnd.
>>>
>>>
>>> Overall, the single connection has some drawbacks on some networks, but
>>> by and large it works better while also providing real server efficiencies
>>> and finally giving the transport the opportunity to do its job better.
>>>  When we split onto zillions of connections, we basically sidestep all of
>>> the transport layer's goodness.  (This is a complex topic too, however).
>>>
>>>
>>>> If I need to fetch 6 small objects (each the size of a single full
>>>> packet) from a server that has an initcwnd of 3,
>>>>
>>>
>>> Why use a server with cwnd of 3?  Default linux distros ship with 10
>>> today (and have done so for like 2 years).
>>>
>>>
>>> Understood. 3 is the value mentioned in Ops Guide so I just used that
>>> for my example. But the argument applies to 10 as well. The main thing is
>>> that if we know the size of the file we can do really clever things to make
>>> connection pools download objects really fast.
>>>
>>>
>>>
>>>
>>>
>>>>  I can request 3 objects on each of two connections and download those
>>>> objects in a single round trip. This is not a theoretical idea -- I have
>>>> tested this and I get the expected performance. In general, a pool of cold
>>>> HTTP/2 connections is much faster than a single cold connection for
>>>> fetching a large number of small objects, especially when you know the size
>>>> up front. I will share the data and demo as soon as I'm able to.
>>>>
>>>
>>> Many have tested this heavily too, so I believe your results.  My own
>>> test data fed into
>>> https://developers.google.com/speed/articles/tcp_initcwnd_paper.pdf
>>>
>>>
>>>> Since I know that multiple connections is faster, I can imagine a
>>>> solution that web performance optimizers will resort to if browsers only
>>>> support one connections per host: domain sharding! Let's avoid this by
>>>> removing the SHOULD NOT from the spec.
>>>>
>>>
>>> I think we need to get into the nitty gritty benchmarking details if we
>>> want to claim that a single connection is faster.  I highly doubt this is
>>> true for all network types.
>>>
>>>
>>> For non-bandwidth limited links, the improvement for multiple
>>> connections should increase with increasing latency and increasing packet
>>> loss. So I have trouble understanding how a single connection could ever
>>> win. I wonder if in your tests you were not able to keep your multiple
>>> connections full b/c that could result in slow start on idle kicking in for
>>> those connections. Is that possible?
>>>
>>>
>>>
>>>
>>>>
>>>> "Servers must keep open idle connections, making load balancing more
>>>> complex and creating DOS vulnerability." A few of you pointed out that the
>>>> server can close them. That's true. I should not have said "must". But
>>>> Mark's Ops Guide suggests that browsers will aggressively keep open idle
>>>> connections for performance reasons, and that servers should support this
>>>> by not closing these connections. And also servers should keep those
>>>> connections fast by disabling slow start after idle. In my opinion,
>>>> browsers should keep connections open only as long as they have the
>>>> expectation of imminent requests to issue on those connections, which is
>>>> essentially the way that mainstream browsers handle connection lifetimes
>>>> for HTTP/1.1 connections today. We should not create an incentive for
>>>> browsers to hold on to connections for longer than this and to encourage
>>>> servers to support longer lived idle connections than they already do today.
>>>>
>>>
>>> What we really need is just a better transport.  We should have 'forever
>>> connections'.  The idea that the endpoints need to maintain state to keep
>>> connections open is so silly; session resumption without a round-trip is
>>> very doable.   I believe QUIC doe this :-)
>>>
>>> Agreed -- I can't wait to see QUIC in action.
>>>
>>>
>>>
>>>>
>>>> Some of you pointed out that a single connection allows us to get back
>>>> to fair congestion control. But TCP slow start and congestion control are
>>>> designed for transferring large objects. They unfairly penalize
>>>> applications that need to fetch a large number of small objects. Are we
>>>> overflowing router buffers today b/c we are using 6 connections per host? I
>>>> agree that reducing that number is a good thing, which HTTP/2 will
>>>> naturally enable. But I don't see any reason to throttle web browsers down
>>>> to a single slow started connection. Also again, web browser and site
>>>> developers will work around this artificial limit. In the future we will
>>>> see 50+ Mbps last mile networks as the norm. This makes extremely fast page
>>>> load times possible, if only we can mitigate the impact of latency by
>>>> increasing the concurrency of object requests. I realize that QUIC may
>>>> eventually solve this issue but in the meantime we need to be able to use
>>>> multiple TCP connections to squeeze the most performance out of today's web.
>>>>
>>>
>>> A tremendous amount of research has gone into this, and you're asking
>>> good questions to which nobody knows the exact answers.  Its not that we
>>> don't know the answers for not trying - its because there are so many
>>> combinations of network equipment, speeds, configs, etc, in the real world
>>> that all real world data is a mix of errors.  Given the research that has
>>> gone into it, I wouldn't expect these answers to come crisply or quickly.
>>>
>>> I agree we'll have 50+Mbps in the not-distant future.  But so far, there
>>> is no evidence that we're figuring out how to bring RTT's down.  Hence,
>>> more bandwidth doesn't matter much:
>>> https://docs.google.com/a/chromium.org/viewer?a=v&pid=sites&srcid=Y2hyb21pdW0ub3JnfGRldnxneDoxMzcyOWI1N2I4YzI3NzE2
>>>
>>>
>>> If we can increase concurrency, via dynamically generated server hints,
>>> then we can start seeing increases in bandwidth show up as increases in
>>> page load time again -- we can make bandwidth matter again. Server hints
>>> also allows us to keep our multiple connections full so we can avoid the
>>> slow start on idle issue without requiring a config change on servers.
>>>
>>>
>>> Mike
>>>
>>>
>>>
>>>
>>>>
>>>> Thanks for reading through all this,
>>>>
>>>> Peter
>>>>
>>>> On Jun 24, 2014, at 3:55 PM, Mike Belshe <mike@belshe.com> wrote:
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jun 24, 2014 at 10:50 AM, <bizzbyster@gmail.com> wrote:
>>>>
>>>>> I've raised this issue before on the list but it's been a while and
>>>>> reading Mark's ops guide doc (
>>>>> https://github.com/http2/http2-spec/wiki/Ops) I'm reminded that
>>>>> requiring the use of a single connection for HTTP/2 ("Clients SHOULD
>>>>> NOT open more than one HTTP/2 connection") still makes no sense to me. Due
>>>>> to multiplexing, HTTP/2 will naturally use FEWER connections than HTTP/1,
>>>>> which is a good thing, but requiring a single connection has the following
>>>>> drawbacks:
>>>>>
>>>>>
>>>>>    1. Servers must keep open idle connections, making load balancing
>>>>>    more complex and creating DOS vulnerability.
>>>>>
>>>>>
>>>> As others have mentioned, you don't have to do this.
>>>>
>>>>>
>>>>>    1. Servers must turn off *tcp_slow_start_after_idle* in order for
>>>>>    browsers to get good performance, again creating DOS vulnerability.
>>>>>
>>>>> You also don't have to do this; it will drop back to init cwnd levels
>>>> if you do, just as though you had opened a fresh connection.
>>>>
>>>>
>>>>>
>>>>>    1. The number of simultaneous GET requests I'm able to upload in
>>>>>    the first round trip is limited to the compressed amount that can fit in a
>>>>>    single initcwnd. Yes compression helps with this but if I use multiple
>>>>>    connections I will get the benefit of compression for the requests on the
>>>>>    same connection, in addition to having multiple initcwnds!
>>>>>
>>>>> It turns out that a larger initcwnd just works better anyway - there
>>>> was a tremendous amount of evidence supporting going up to 10, and that was
>>>> accepted at in the transport level already.
>>>>
>>>>
>>>>>
>>>>>    1. The amount of data I'm able to download in the first round trip
>>>>>    is limited to the amount that can fit in a single initcwnd.
>>>>>
>>>>> It turns out the browser doesn't really know how many connections to
>>>> open until that first resource is downloaded anyway.  Many out-of-band
>>>> tricks exist.
>>>>
>>>>
>>>>>
>>>>>    1. Head of line blocking is exacerbated by putting all objects on
>>>>>    a single connection.
>>>>>
>>>>> Yeah, this is true.  But overall, its still faster and more efficient.
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> Multiple short-lived HTTP/2 connections gives us all the performance
>>>>> benefits of multiplexing without any of the operational or performance
>>>>> drawbacks. As a proxy and a browser implementor, I plan to use
>>>>> multiple HTTP/2 connections when talking to HTTP/2 servers because it seems
>>>>> like the right thing to do from a performance, security, and operational
>>>>> perspective.
>>>>>
>>>>
>>>> When I tested the multi-connection scenarios they were all slower for
>>>> me.  In cases of severe packet loss, it was difficult to discern as
>>>> expected.  But overall, the reduced server resource use and the efficiency
>>>> outweighed the negatives.
>>>>
>>>> Mike
>>>>
>>>>
>>>>
>>>>>
>>>>> I know it's very late to ask this but can we remove the "SHOULD NOT"
>>>>> statement from the spec? Or, maybe soften it a little for those of us who
>>>>> cannot understand why it's there?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Peter
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>
>
Received on Tuesday, 1 July 2014 04:54:21 UTC