- From: Willy Tarreau <w@1wt.eu>
- Date: Wed, 25 Jun 2014 22:22:19 +0200
- To: Roberto Peon <grmocg@gmail.com>
- Cc: Poul-Henning Kamp <phk@phk.freebsd.dk>, Martin Thomson <martin.thomson@gmail.com>, Jason Greene <jason.greene@redhat.com>, Mark Nottingham <mnot@mnot.net>, HTTP Working Group <ietf-http-wg@w3.org>
On Wed, Jun 25, 2014 at 01:02:14PM -0700, Roberto Peon wrote: > > Especially if moved to sillicon, because the round-trip to hardware > > is particularly expensive, which is why some chip makers have moved > > the crypto accelerators into the CPU's instruction set for example. > > > > > We have examples of this kind of thing in hardware and working to reduce > cost today: > TCP Segment Offload (TSO) offloads making TCP segments to the NIC's > hardware. TSO works if you feed it with enough data. That's precisely one point where 16kB makes it totally useless, because the cost of preparing the fragment descriptors for the NICs overrides the savings of cutting them into a few packets in the TCP stack. > If we're talking about non-TLS stuff, then doing this kind of simple thing > on the NIC doesn't seem that hard. It is roughly the same thing. Not for an intermediary. I have to receive muxed streams from multiple servers and deliver them over muxed connections to multiple clients. Please consider this simple use case : - requests for /img /css /js /static go to server 1 - requests for /video go to server 2 - requests for other paths go to server 3 Clients send their requests over the same connection. The load balancer has several connections to servers 1 and 3 behind and forwards clients' requests over these connections to retrieve objects. In practice, a client will go first to server 3 (GET /) then to server 1 (retrieve page components) then to server 2 over a fresh connection and stay there for a long time. There's no place for NIC-based acceleration here because this MUX pulls data from one side and transfers it to the other side in small chunks. When the video starts, if we had the ability to splice large chunks from server 2 to client, there would be a real benefit. With the small chunks, the benefits disappear and we're back to doing the same recv+memcpy()+send job as for the other servers (double to triple copy instead of zero). Sure this is not something a common server or user-agent is even able to detect at regular loads. But you (and you particularly) know like me that intermediaries need to shave off everything possible to avoid wasting time doing dumb things such as copying small data or visiting the same byte twice, etc... To be clear, this is not the end of the world, this will only probably lead to a significant part of the internet not deploying what was designed here, waiting for H3 to appear, when we could easily make it worth for them to consider the option. Regards, Willy
Received on Wednesday, 25 June 2014 20:22:49 UTC