Re: Stuck in a train -- reading HTTP/2 draft.

On Wed, Jun 25, 2014 at 1:22 PM, Willy Tarreau <w@1wt.eu> wrote:

> On Wed, Jun 25, 2014 at 01:02:14PM -0700, Roberto Peon wrote:
> > > Especially if moved to sillicon, because the round-trip to hardware
> > > is particularly expensive, which is why some chip makers have moved
> > > the crypto accelerators into the CPU's instruction set for example.
> > >
> > >
> > We have examples of this kind of thing in hardware and working to reduce
> > cost today:
> > TCP Segment Offload (TSO) offloads making TCP segments to the NIC's
> > hardware.
>
> TSO works if you feed it with enough data. That's precisely one point
> where 16kB makes it totally useless, because the cost of preparing the
> fragment descriptors for the NICs overrides the savings of cutting them
> into a few packets in the TCP stack.
>

Frame size != write() size, unless you're attempting to use Splice() (which
often isn't cheaper, sadly).


>
> > If we're talking about non-TLS stuff, then doing this kind of simple
> thing
> > on the NIC doesn't seem that hard. It is roughly the same thing.
>
> Not for an intermediary. I have to receive muxed streams from multiple
> servers and deliver them over muxed connections to multiple clients.
>
> Please consider this simple use case :
>
>    - requests for /img /css /js /static go to server 1
>    - requests for /video go to server 2
>    - requests for other paths go to server 3
>
> Clients send their requests over the same connection. The load balancer
> has several connections to servers 1 and 3 behind and forwards clients'
> requests over these connections to retrieve objects. In practice, a
> client will go first to server 3 (GET /) then to server 1 (retrieve
> page components) then to server 2 over a fresh connection and stay
> there for a long time. There's no place for NIC-based acceleration
> here because this MUX pulls data from one side and transfers it to
> the other side in small chunks. When the video starts, if we had the
> ability to splice large chunks from server 2 to client, there would
> be a real benefit. With the small chunks, the benefits disappear and
> we're back to doing the same recv+memcpy()+send job as for the other
> servers (double to triple copy instead of zero).
>

If the hardware was similar to TSO hardware, that case would work just as
it does for TSO.
You send data to the NIC via the kernel, potentially with a different API
(e.g. socket per stream), and the NIC (or kernel) fragments into frames.
It'd work just fine.


> Sure this is not something a common server or user-agent is even able
> to detect at regular loads. But you (and you particularly) know like
> me that intermediaries need to shave off everything possible to avoid
> wasting time doing dumb things such as copying small data or visiting
> the same byte twice, etc...
>
>
True (you've seen me push to reduce memory requirements), though my
motivation w.r.t. CPU consumption is to balance any change in resource
requirements against the change in the user's experience.


> To be clear, this is not the end of the world, this will only probably
> lead to a significant part of the internet not deploying what was designed
> here, waiting for H3 to appear, when we could easily make it worth for
> them to consider the option.
>
> Regards,
> Willy
>
>

-=R

Received on Wednesday, 25 June 2014 20:57:00 UTC