Re: why not multiple, short-lived HTTP/2 connections? from Mike Belshe on 2014-06-30 (ietf-http-wg@w3.org from April to June 2014)

From: Mike Belshe <mike@belshe.com>
Date: Mon, 30 Jun 2014 10:00:09 -0700
To: Peter Lepeska <bizzbyster@gmail.com>
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <CABaLYCte2=rz19ZB-qUC-6Q=Ya0Rh0yD+mPRG7-knu9vDQfyKA@mail.gmail.com>
On Mon, Jun 30, 2014 at 9:04 AM, <bizzbyster@gmail.com> wrote:

> All,
>
> Another huge issue is that for some reason I still see many TCP
> connections that do not advertise support for window scaling in the SYN
> packet. I'm really not sure why this is but for instance WPT test instances
> are running Windows 7 and yet they do not advertise window scaling and so
> TCP connections max out at a send window of 64 KB. I've seen this in tests
> run out of multiple different WPT test locations.
>
> The impact of this is that high latency connections max out at very low
> throughputs. Here's an example (with tcpdump output so you can examine the
> TCP flow on the wire) where I download data from a SPDY-enabled web server
> in Virginia from a WPT test instance running in Sydney:
> http://www.webpagetest.org/result/140629_XG_1JC/1/details/. Average
> throughput is not even 3 Mbps despite the fact that I chose a 20 Mbps FIOS
> connection for my test. Note that when I disable SPDY on this web server, I
> render the page almost twice as fast because I am using multiple
> connections and therefore overcoming the per connection throughput
> limitation: http://www.webpagetest.org/result/140629_YB_1K5/1/details/.
>
> I don't know the root cause (Windows 7 definitely sends windows scaling
> option in SYN in other tests) and have sent a note to the webpagetest.org
> admin but in general there are reasons why even Windows 7 machines
> sometimes appear to not use Windows scaling, causing single connection SPDY
> to perform really badly even beyond the slow start phase.
>

I believe your test is invalid.

The time-to-first-byte (single connection) in the first case is 1136ms.
 The time-to-first-byte is 768ms.  They should have been identical, right?
 What this really means is that you're testing over the public net, and
your variance is somewhere in the 50% range.  The key to successful
benchmarking is eliminating variance and *lots* of iterations.

I'm not saying you aren't seeing a trend.  You might be, but this test
doesn't show it.  But variance is exacerbating the issue, the benchmark
transport is suspect (for reasons you've already identified), and I further
suspect that you're using a funky server transport connection (what is your
initial CWND?  what is your initial server recv buffer?)

I also looked at the TCP dumps.  Here are some random notes:
   * Looks like your SSL is using 16KB record sizes.  That will cause
delays.
   * I'm not seeing any packet loss (haven't used cloudshark before, so
maybe I'm not looking carefully enough).  If there is no packet loss, and
we're not bandwidth constrained, then this is just test-environment
related, right?
   * Notice that the time-to-first render (which is the important metric
for a page which is all images) makes the case you call the slow case
faster (yes, its 1.383s instead of 1.026s, but remember it had a 368ms
handicap (1136-768ms due to variance mentioned above)
    * there is noise in this test.  the first case sent 88KB/109pkts
to 74.125.237.183
which the second test didn't send.  I believe this is the browser doing
something behind your back but interfering with your test.  Not sure.
    * what is the cwnd on your server?

Overall, I can't explain why you're seeing a different result in this test,
except to point out all the variances which make the test look suspect.




>
> SPDY/HTTP2 is supposed to be about faster. Let's encourage browser
> developers to make use of the new protocol to make web browsing as fast as
> possible -- and not limit them to one connection and therefore essentially
> ask them to do so with one hand tied behind their backs.
>

Again, after all the testing I did, it was clear that single connection is
the best route.

I didn't reply earlier about packet loss, but the packet loss simulators
generally issue "random" packet loss.  This is different from "correlated"
packet loss like a buffer tail drop in a router.  With the former, a
single-connection protocol gets screwed, because the cwnd gets cut in half
on a single connection, while a 2-connection load would only have its cwnd
cut by 1/4 with a single loss.  But in *correlated* loss modeling, we see
*all* connections get packet loss at the same time, because the entire
buffer was lost in the router.  The net result is that if you simulate tail
drops, you tend to see the cwnd collapse due to packet lose be much more
similar in single and multi connection policies.

Mike




>
> Thanks,
>
> Peter
>
>
> On Jun 25, 2014, at 5:27 PM, bizzbyster@gmail.com wrote:
>
> Responses inline.
>
> On Jun 25, 2014, at 11:47 AM, Mike Belshe <mike@belshe.com> wrote:
>
>
>
>
> On Wed, Jun 25, 2014 at 7:56 AM, <bizzbyster@gmail.com> wrote:
>
>> Thanks for all the feedback. I'm going to try to reply to Mike, Greg,
>> Willy, and Guille in one post since a few of you made the same or similar
>> points. My apologies in advance for the very long post.
>>
>> First, you should understand that I am building a browser and web server
>> that use the feedback loop described here (
>> http://caffeinatetheweb.com/baking-acceleration-into-the-web-itself/) to
>> provide the browser with a set of hints inserted into the html that allow
>> it to load the page much faster. I prefer subresource hints to server push
>> because A) it works in coordination with the browser cache state and B)
>> hints can be supplied for resources found on third party domains. But my
>> hints also go beyond just supplying the browser with a list of URLs to
>> fetch:
>> http://lists.w3.org/Archives/Public/public-web-perf/2014Jun/0044.html.
>> They also include an estimate of the size of the objects for instance.
>>
>> Okay so this is all relevant because it means that I often know the large
>> number of objects (sometimes 50+) I need to fetch from a given server up
>> front and therefore have to figure out the optimal way to retrieve these
>> objects. Unlike Mike's tests, my tests have shown that a pool with multiple
>> connections is faster than a single one, perhaps because my server hints
>> allow me to know about a much larger number of URLs up front and because I
>> often have expected object sizes.
>>
>
> With appropriate hand-crafted and customized server hints, I'm not
> surprised that you can outpace a single connection in some scenarios.
>
> But, the answer to "which is faster" will not be a boolean yes/no - you
> have to look at all sorts of network conditions, including link speed, RTT,
> and packet loss.
>
> The way I chose how to optimize was based on studying how networks are
> evolving over time:
>    a) we know that bandwidth is going up fairly rapidly to end users
>    b) we know that RTT is not changing much, and in some links going up
>    c) packet loss is getting better, but is very difficult to pin down &
> even harder to appropriately model (tail drops vs random, etc)
>
> So I chose to optimize assuming that (b) will continue to hold true.  In
> your tests, you should try jacking the RTT up to 100ms+.  Average RTT to
> Google is ~100ms (my data slightly old here).  If you're on a super fast
> link, then sure, initcwnd will be your bottleneck, because RTT is not a
> factor.  What RTTs did you simulate?  I'm guessing you were using a
> high-speed local network?
>
>
> We are testing at many different latencies (including satellite) but also
> the benefit of multiple connections actually increases as RTT increases
> because connections are in a slow start bottlenecked state for longer for
> higher latency links. After 4 RTTs a single connection will be able to
> transfer roughly initcwnd*(2)^4 in the next round trip whereas 3
> connections will be able to transfer 3*initcwnd*(2)^4, assuming we can keep
> all three connections full and we are not bandwidth limited.
>
> In addition, multiple connections are more robust to packet loss. If we
> drop a packet after 4 RTTs, the overall throughput for the single
> connection case drops back down to its initial rate before exponentially
> growing again whereas only one connection in the 3 connection case will
> fall back to initcwnd.
>
>
> Overall, the single connection has some drawbacks on some networks, but by
> and large it works better while also providing real server efficiencies and
> finally giving the transport the opportunity to do its job better.  When we
> split onto zillions of connections, we basically sidestep all of the
> transport layer's goodness.  (This is a complex topic too, however).
>
>
>> If I need to fetch 6 small objects (each the size of a single full
>> packet) from a server that has an initcwnd of 3,
>>
>
> Why use a server with cwnd of 3?  Default linux distros ship with 10 today
> (and have done so for like 2 years).
>
>
> Understood. 3 is the value mentioned in Ops Guide so I just used that for
> my example. But the argument applies to 10 as well. The main thing is that
> if we know the size of the file we can do really clever things to make
> connection pools download objects really fast.
>
>
>
>
>
>>  I can request 3 objects on each of two connections and download those
>> objects in a single round trip. This is not a theoretical idea -- I have
>> tested this and I get the expected performance. In general, a pool of cold
>> HTTP/2 connections is much faster than a single cold connection for
>> fetching a large number of small objects, especially when you know the size
>> up front. I will share the data and demo as soon as I'm able to.
>>
>
> Many have tested this heavily too, so I believe your results.  My own test
> data fed into
> https://developers.google.com/speed/articles/tcp_initcwnd_paper.pdf
>
>
>> Since I know that multiple connections is faster, I can imagine a
>> solution that web performance optimizers will resort to if browsers only
>> support one connections per host: domain sharding! Let's avoid this by
>> removing the SHOULD NOT from the spec.
>>
>
> I think we need to get into the nitty gritty benchmarking details if we
> want to claim that a single connection is faster.  I highly doubt this is
> true for all network types.
>
>
> For non-bandwidth limited links, the improvement for multiple connections
> should increase with increasing latency and increasing packet loss. So I
> have trouble understanding how a single connection could ever win. I wonder
> if in your tests you were not able to keep your multiple connections full
> b/c that could result in slow start on idle kicking in for those
> connections. Is that possible?
>
>
>
>
>>
>> "Servers must keep open idle connections, making load balancing more
>> complex and creating DOS vulnerability." A few of you pointed out that the
>> server can close them. That's true. I should not have said "must". But
>> Mark's Ops Guide suggests that browsers will aggressively keep open idle
>> connections for performance reasons, and that servers should support this
>> by not closing these connections. And also servers should keep those
>> connections fast by disabling slow start after idle. In my opinion,
>> browsers should keep connections open only as long as they have the
>> expectation of imminent requests to issue on those connections, which is
>> essentially the way that mainstream browsers handle connection lifetimes
>> for HTTP/1.1 connections today. We should not create an incentive for
>> browsers to hold on to connections for longer than this and to encourage
>> servers to support longer lived idle connections than they already do today.
>>
>
> What we really need is just a better transport.  We should have 'forever
> connections'.  The idea that the endpoints need to maintain state to keep
> connections open is so silly; session resumption without a round-trip is
> very doable.   I believe QUIC doe this :-)
>
> Agreed -- I can't wait to see QUIC in action.
>
>
>
>>
>> Some of you pointed out that a single connection allows us to get back to
>> fair congestion control. But TCP slow start and congestion control are
>> designed for transferring large objects. They unfairly penalize
>> applications that need to fetch a large number of small objects. Are we
>> overflowing router buffers today b/c we are using 6 connections per host? I
>> agree that reducing that number is a good thing, which HTTP/2 will
>> naturally enable. But I don't see any reason to throttle web browsers down
>> to a single slow started connection. Also again, web browser and site
>> developers will work around this artificial limit. In the future we will
>> see 50+ Mbps last mile networks as the norm. This makes extremely fast page
>> load times possible, if only we can mitigate the impact of latency by
>> increasing the concurrency of object requests. I realize that QUIC may
>> eventually solve this issue but in the meantime we need to be able to use
>> multiple TCP connections to squeeze the most performance out of today's web.
>>
>
> A tremendous amount of research has gone into this, and you're asking good
> questions to which nobody knows the exact answers.  Its not that we don't
> know the answers for not trying - its because there are so many
> combinations of network equipment, speeds, configs, etc, in the real world
> that all real world data is a mix of errors.  Given the research that has
> gone into it, I wouldn't expect these answers to come crisply or quickly.
>
> I agree we'll have 50+Mbps in the not-distant future.  But so far, there
> is no evidence that we're figuring out how to bring RTT's down.  Hence,
> more bandwidth doesn't matter much:
> https://docs.google.com/a/chromium.org/viewer?a=v&pid=sites&srcid=Y2hyb21pdW0ub3JnfGRldnxneDoxMzcyOWI1N2I4YzI3NzE2
>
>
> If we can increase concurrency, via dynamically generated server hints,
> then we can start seeing increases in bandwidth show up as increases in
> page load time again -- we can make bandwidth matter again. Server hints
> also allows us to keep our multiple connections full so we can avoid the
> slow start on idle issue without requiring a config change on servers.
>
>
> Mike
>
>
>
>
>>
>> Thanks for reading through all this,
>>
>> Peter
>>
>> On Jun 24, 2014, at 3:55 PM, Mike Belshe <mike@belshe.com> wrote:
>>
>>
>>
>>
>> On Tue, Jun 24, 2014 at 10:50 AM, <bizzbyster@gmail.com> wrote:
>>
>>> I've raised this issue before on the list but it's been a while and
>>> reading Mark's ops guide doc (
>>> https://github.com/http2/http2-spec/wiki/Ops) I'm reminded that
>>> requiring the use of a single connection for HTTP/2 ("Clients SHOULD
>>> NOT open more than one HTTP/2 connection") still makes no sense to me. Due
>>> to multiplexing, HTTP/2 will naturally use FEWER connections than HTTP/1,
>>> which is a good thing, but requiring a single connection has the following
>>> drawbacks:
>>>
>>>
>>>    1. Servers must keep open idle connections, making load balancing
>>>    more complex and creating DOS vulnerability.
>>>
>>>
>> As others have mentioned, you don't have to do this.
>>
>>>
>>>    1. Servers must turn off *tcp_slow_start_after_idle* in order for
>>>    browsers to get good performance, again creating DOS vulnerability.
>>>
>>> You also don't have to do this; it will drop back to init cwnd levels if
>> you do, just as though you had opened a fresh connection.
>>
>>
>>>
>>>    1. The number of simultaneous GET requests I'm able to upload in the
>>>    first round trip is limited to the compressed amount that can fit in a
>>>    single initcwnd. Yes compression helps with this but if I use multiple
>>>    connections I will get the benefit of compression for the requests on the
>>>    same connection, in addition to having multiple initcwnds!
>>>
>>> It turns out that a larger initcwnd just works better anyway - there was
>> a tremendous amount of evidence supporting going up to 10, and that was
>> accepted at in the transport level already.
>>
>>
>>>
>>>    1. The amount of data I'm able to download in the first round trip
>>>    is limited to the amount that can fit in a single initcwnd.
>>>
>>> It turns out the browser doesn't really know how many connections to
>> open until that first resource is downloaded anyway.  Many out-of-band
>> tricks exist.
>>
>>
>>>
>>>    1. Head of line blocking is exacerbated by putting all objects on a
>>>    single connection.
>>>
>>> Yeah, this is true.  But overall, its still faster and more efficient.
>>
>>
>>
>>
>>>
>>> Multiple short-lived HTTP/2 connections gives us all the performance
>>> benefits of multiplexing without any of the operational or performance
>>> drawbacks. As a proxy and a browser implementor, I plan to use multiple
>>> HTTP/2 connections when talking to HTTP/2 servers because it seems like the
>>> right thing to do from a performance, security, and operational perspective.
>>>
>>
>> When I tested the multi-connection scenarios they were all slower for me.
>>  In cases of severe packet loss, it was difficult to discern as expected.
>>  But overall, the reduced server resource use and the efficiency outweighed
>> the negatives.
>>
>> Mike
>>
>>
>>
>>>
>>> I know it's very late to ask this but can we remove the "SHOULD NOT"
>>> statement from the spec? Or, maybe soften it a little for those of us who
>>> cannot understand why it's there?
>>>
>>> Thanks,
>>>
>>> Peter
>>>
>>
>>
>>
>
>
>
Received on Monday, 30 June 2014 17:00:41 UTC