Re: Fwd: Re: [tcpm] FW: Call for Adoption: TCP Tuning for HTTP from Willy Tarreau on 2016-03-03 (ietf-http-wg@w3.org from January to March 2016)

From: Willy Tarreau <w@1wt.eu>
Date: Thu, 3 Mar 2016 23:01:08 +0100
To: Patrick McManus <mcmanus@ducksong.com>
Cc: Joe Touch <touch@isi.edu>, HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <20160303220108.GB23875@1wt.eu>
Hi Pat,

On Thu, Mar 03, 2016 at 03:43:38PM -0500, Patrick McManus wrote:
> Hi Wily, Joe,
> 
> This message is a bit of a diversion from the discussion so far. sorry bout
> that.

No pb, anyway we're discussing the issues our various implementations have
to face in field.

> On Thu, Mar 3, 2016 at 1:44 PM, Willy Tarreau <w@1wt.eu> wrote:
> 
> > I've seen people
> > patch their kernels to lower the TIME_WAIT down to 2 seconds to address
> > such shortcomings! Quite frankly, this workaround *is* causing trouble!
> >
> 
> really? That's fascinating to me. Can you provide background or citations
> on what kind of trouble has been attributed to this and the scenario where
> it was done?

Yes that's very simple. Someone runs a load balancer that has to close
server connections itself and that does not abuse SO_LINGER to force an
RST to be emitted. At one point this person notices that the load starts
to look like spikes approaching 1000 connections per second, is very
irregular and that a lot of connection errors are reported to the clients. 
The person runs netstat and sees thousands of established connections
from hosts everywhere on the net, and very few established connections
to the servers. Looking closer, it appears that the total number of
connections per server is 64512, all state included. That number sounds
familiar to all those who already extend their source port range from
1024 to 65535. The problem is that almost all connections are in
TIME_WAIT state, very few are established on this side. The reason is
that the point where you create connections faster than they expire has
been reached. The person then searches on the net and finds that this
problem is well known and various people use different methods depending
on the OS. On Solaris it was common to reduce the time_wait timeout which
was configurable (I believe it was the same as the close_wait but I could
be wrong, it was 10 years ago at least). On Linux people often suggest to
enable tcp_recycle (which you must never do).

In both cases, that appears to solve the issue but then the person starts
noticing that the connection times are very erratic and that an abnormal
number of TCP resets are reported by firewalls. And worse, since the symptoms
are the same as at the beginning of the problem, the person further reduces
the timeouts.

What happens is that from time to time under high load you'll experience
lost packets, especially in networks involving many connection opening/
closing because you can't count on the flow control imposed by the congestion
control algorithm for connections that are independant on each other, so there
are some packet storms that can exceed some switch buffers or even the
destination NIC's buffers.

When the lost packet is the last one (the ACK that the side in FIN_WAIT
sends in response to the FIN before entering the TIME_WAIT state), the
other one will retransmit the FIN. Please keep in mind we were considering
that the TCP client closed first so this TIME_WAIT side is the client. At
low enough loads, the first bad actor is a firewall between the LB and the
server (more and more often this firewall is implemented on the server
itself or worse, on the underlying hypervisor which nobody controls nor
monitors). Many firewalls are tuned with aggressive FIN_WAIT/TIME_WAIT
timeouts which cause their session to expire before the other side dares
to retransmit. The server then remains for a long time in LAST_ACK state,
resending this last ACK packet for some time before giving up.

But by this time, our nice client has already used all other ports and
needs to reuse this port. Since its TIME_WAIT timeout was reduced to
something lower than the server's LAST_ACK, it believes the port is
free and reuses it. It sends a SYN which passes through the firewall,
this SYN reaches the server which disagrees and sends an RST back
(when the client picked a new SYN above the end of previous window)
or an ACK which will generally be blocked by the firewall, or if the
firewall accepts it, will be transmitted to the client which will then
send an RST, wait one second and send the SYN again. As you can imagine,
this dance is quite counter-productive for the performance since you
convert hundreds of microseconds to multiples of seconds to establish
certain connections.

And above a certain connection rate, you don't need a firewall anymore
to mangnify the problem. If the client wraps its ports once a second
or so, often it's not possible anymore to set a TIME_WAIT timeout to
a lower value (due to the resolution of the settings), so they have
to use tcp_recycle to bypass the TIME_WAIT. You get the same effect as
above, with some connections being responded to with an RST and other
ones with an ACK instead of a SYN-ACK, to which the client sends a RST,
waits one second or so and tries again.

And guess what ? I've seen people rush haproxy in production with
"option nolinger" set on both sides to send RSTs to both sides just to
work around this, without even realizing that they were only hiding the
problem, transforming long connect times into truncated responses :-(

> You don't need to go through the theoretical - I know what TW could
> conceptually catch - but the assertion about the shorter timeout causing
> field problems is something I'd love to understand better.

The TW is a very important state if you want to guarantee correctness
on your sessions in a situation where ports wrap quickly (and that's
the only case we're interested in BTW).

> For TW to be useful protection it also has to be paired with
> re-transmission and some application states that will be impacted by the
> screwup which reduces its utility, particularly for HTTP. I read the above
> statement as saying that TW is indeed useful in the field at the
> application layer - but maybe it is referencing some side effect I'm not
> thinking of rather than the vulnerability of not using it.

No, the application even doesn't know about it (until it fails to connect
because of it of course). However when the application is aware of how it
works, what protection it provides and what issues it can cause, then it
can try to do its best not to have to cheat on it nor to be impacted by
it. For example, I remember on Solaris 8 you had 32768 source ports and
240s time_wait timeout by default. That's a limit of only 136 connections
per second before being blocked if the client closes first. I remember it
even caused an Apache benchmark to fail at a customer's, they said that
"Apache couldn't go beyond 100 connections per second and it had a bug
because it couldn't use all the available CPU"!

> Those kinds of post mortem war stories where it is seen in the field are
> pretty interesting and help inform the discussion about whether the cure is
> worse than the disease.

Definitely. And I remember your sentence "pcap or it didn't happen" :-)

> My inclination has generally been that TW doesn't
> help a lot in practice and has some limitations and causes pain (as well
> documented in this thread.). So it would be interesting to look at the
> fallout of a situation it could have helped with.

The real issues we have is that TCP stacks continue to use large timeouts
by default, and these ones multiplied by a number of retries quickly end
up being fairly large. It still makes sense over congested links to wait
a few seconds, but we just want to wait for a few *retransmits* not seconds.
Mobile operating systems seem to retransmit very fast. I've seen delays
around 150 ms between two SYNs on a smartphone in 2010, and the exponential
back-off didn't immediately start. That participate to the RAN congestion
of couse.

> This feels a bit like the musing over the subpar utility of the tcp
> checksum on high bandwidth networks. For that one, the answer at least in
> the http space is 'use https for integrity and sort out the rare error at
> the application level'. I'm wondering if that's the right advice in the
> time_wait space for http as well.. we're really still talking about
> integrity. Go ahead and turn it off - just make sure you're running a
> higher level protocol that won't confuse old data with new data.

It doesn't work well for the connection setup as explained above. I could
agree with you that regarding integrity, multiple solutions are offered.
We even have MD5 checksums in standard if needed. But you need to be able
to quickly recover from an unclear situation, and TIME_WAIT helps you
never being in that unclear situation.

> (Fun paper: http://conferences2.sigcomm.org/imc/2015/papers/p303.pdf showed
> that at the tail of a ping survey, 1 % of replies from 1% of addresses
> needed >= 145 seconds to arrive. And that's just delay - not
> retransmission. A truly protective TW is a very large number.)

Not surprized. I've seen the same order of magnitude in 2010 on a mobile
operator's RAN. The issue comes from deep buffers. Since the network link
can transparently handover between different speeds and the bandwidth is
huge, they need to store megabytes of data, literally. Well, at some point
you have to wait for these megabytes to reach the LAN over a shared link...
And there TCP's retransmits can hurt because at the output of the pipe,
you see a lot of duplicates, sometimes indicating that 20-50% of the data
were retransmits already.

Due to all this I'm not surprized at all about the ongoing works to port
the transport to userlad that we've seen last summer in Münster! There's
an amazing room for improvement that's left unexploited and that cannot
realistically be exploited by just tuning TCP without breaking
compatibility with everything already deployed!

Cheers,
Willy
Received on Thursday, 3 March 2016 22:01:47 UTC