Re: #540: "jumbo" frames from Willy Tarreau on 2014-06-25 (ietf-http-wg@w3.org from April to June 2014)

From: Willy Tarreau <w@1wt.eu>
Date: Wed, 25 Jun 2014 18:12:18 +0200
To: Johnny Graettinger <jgraettinger@chromium.org>
Cc: Patrick McManus <pmcmanus@mozilla.com>, Mark Nottingham <mnot@mnot.net>, K.Morgan@iaea.org, Poul-Henning Kamp <phk@phk.freebsd.dk>, Greg Wilkins <gregw@intalio.com>, HTTP Working Group <ietf-http-wg@w3.org>, Martin Dürst <duerst@it.aoyama.ac.jp>
Message-ID: <20140625161218.GN5531@1wt.eu>

On Wed, Jun 25, 2014 at 11:48:08AM -0400, Johnny Graettinger wrote:
> FWIW, I don't buy the premise that the current framing mechanism requires
> more frequent system calls, or must imply "lower performance" for large
> sends. One frame != one system call to write or read that frame.
> 
> Vectored IO API's are available to write multiple frames with a single
> call.

... and they result in a data copy which is often even more expensive than
the syscall it tried to save.

When you're forwarding data between two TCP sockets and can make use of
TCP splicing, you have to play with pages (4kB). Anything not a full page
will result in a copy. And recv()+send() will result in two copies. That's
why splice() on small sizes (typically 16kB) offers no benefit : the first
and/or the last page are often incomplete or unaligned, resulting in only
the two middle ones being spliced from the CPU's L3 cache without ever
hitting memory. Also, with a 14-bit encoding, we cannot transfer 16 kB,
we can at most transfer 16kB-1, so you're always guaranteed to *copy*
4095 bytes at the end of each transfer and to misalign every data block.

16kB are *really* suboptimal for large transfers. I just got a report of
a company reaching 58 Gbps of forwarded traffic with haproxy, using a
splice size of 512kB. There *are* definitely a lot of losses to expect
from running at 16kB-1. I'd give 10-15 Gbps for the setup above, not
more.

I'm not advocating for making things complex nor for breaking the protocol
either, I'm just reporting real world scenarios which work very well with
1.1 and which are 2.0-unfriendly. It's not too late to try to fix these
corner cases. I think that the ability to use just one bit to change the
size unit is reasonable. It's exactly what TCP uses with Window Scaling
and it works pretty well. We can easily accept that the unit only changes
after the first round trip if needed, what matters is that not all the
stream is retrieved in tiny 16kB chunks.

Regards,
Willy

Received on Wednesday, 25 June 2014 16:13:42 UTC