- From: Roberto Peon <grmocg@gmail.com>
- Date: Wed, 25 Jun 2014 13:56:32 -0700
- To: Willy Tarreau <w@1wt.eu>
- Cc: Poul-Henning Kamp <phk@phk.freebsd.dk>, Martin Thomson <martin.thomson@gmail.com>, Jason Greene <jason.greene@redhat.com>, Mark Nottingham <mnot@mnot.net>, HTTP Working Group <ietf-http-wg@w3.org>
- Message-ID: <CAP+FsNcOuSYHw=OS2guJPQdOimF08hRBQ+-qTbkxueAHJ=MO9g@mail.gmail.com>
On Wed, Jun 25, 2014 at 1:22 PM, Willy Tarreau <w@1wt.eu> wrote: > On Wed, Jun 25, 2014 at 01:02:14PM -0700, Roberto Peon wrote: > > > Especially if moved to sillicon, because the round-trip to hardware > > > is particularly expensive, which is why some chip makers have moved > > > the crypto accelerators into the CPU's instruction set for example. > > > > > > > > We have examples of this kind of thing in hardware and working to reduce > > cost today: > > TCP Segment Offload (TSO) offloads making TCP segments to the NIC's > > hardware. > > TSO works if you feed it with enough data. That's precisely one point > where 16kB makes it totally useless, because the cost of preparing the > fragment descriptors for the NICs overrides the savings of cutting them > into a few packets in the TCP stack. > Frame size != write() size, unless you're attempting to use Splice() (which often isn't cheaper, sadly). > > > If we're talking about non-TLS stuff, then doing this kind of simple > thing > > on the NIC doesn't seem that hard. It is roughly the same thing. > > Not for an intermediary. I have to receive muxed streams from multiple > servers and deliver them over muxed connections to multiple clients. > > Please consider this simple use case : > > - requests for /img /css /js /static go to server 1 > - requests for /video go to server 2 > - requests for other paths go to server 3 > > Clients send their requests over the same connection. The load balancer > has several connections to servers 1 and 3 behind and forwards clients' > requests over these connections to retrieve objects. In practice, a > client will go first to server 3 (GET /) then to server 1 (retrieve > page components) then to server 2 over a fresh connection and stay > there for a long time. There's no place for NIC-based acceleration > here because this MUX pulls data from one side and transfers it to > the other side in small chunks. When the video starts, if we had the > ability to splice large chunks from server 2 to client, there would > be a real benefit. With the small chunks, the benefits disappear and > we're back to doing the same recv+memcpy()+send job as for the other > servers (double to triple copy instead of zero). > If the hardware was similar to TSO hardware, that case would work just as it does for TSO. You send data to the NIC via the kernel, potentially with a different API (e.g. socket per stream), and the NIC (or kernel) fragments into frames. It'd work just fine. > Sure this is not something a common server or user-agent is even able > to detect at regular loads. But you (and you particularly) know like > me that intermediaries need to shave off everything possible to avoid > wasting time doing dumb things such as copying small data or visiting > the same byte twice, etc... > > True (you've seen me push to reduce memory requirements), though my motivation w.r.t. CPU consumption is to balance any change in resource requirements against the change in the user's experience. > To be clear, this is not the end of the world, this will only probably > lead to a significant part of the internet not deploying what was designed > here, waiting for H3 to appear, when we could easily make it worth for > them to consider the option. > > Regards, > Willy > > -=R
Received on Wednesday, 25 June 2014 20:57:00 UTC