Re: Stuck in a train -- reading HTTP/2 draft.

On Wed, Jun 25, 2014 at 01:56:32PM -0700, Roberto Peon wrote:
> On Wed, Jun 25, 2014 at 1:22 PM, Willy Tarreau <w@1wt.eu> wrote:
> > TSO works if you feed it with enough data. That's precisely one point
> > where 16kB makes it totally useless, because the cost of preparing the
> > fragment descriptors for the NICs overrides the savings of cutting them
> > into a few packets in the TCP stack.
> >
> 
> Frame size != write() size, unless you're attempting to use Splice() (which
> often isn't cheaper, sadly).

That's clearly my intent (sorry if I didn't make it clear initially) and
for it to be cheaper, it needs to operate on large blocks.

> > Please consider this simple use case :
> >
> >    - requests for /img /css /js /static go to server 1
> >    - requests for /video go to server 2
> >    - requests for other paths go to server 3
> >
> > Clients send their requests over the same connection. The load balancer
> > has several connections to servers 1 and 3 behind and forwards clients'
> > requests over these connections to retrieve objects. In practice, a
> > client will go first to server 3 (GET /) then to server 1 (retrieve
> > page components) then to server 2 over a fresh connection and stay
> > there for a long time. There's no place for NIC-based acceleration
> > here because this MUX pulls data from one side and transfers it to
> > the other side in small chunks. When the video starts, if we had the
> > ability to splice large chunks from server 2 to client, there would
> > be a real benefit. With the small chunks, the benefits disappear and
> > we're back to doing the same recv+memcpy()+send job as for the other
> > servers (double to triple copy instead of zero).
> >
> 
> If the hardware was similar to TSO hardware, that case would work just as
> it does for TSO.
> You send data to the NIC via the kernel, potentially with a different API
> (e.g. socket per stream), and the NIC (or kernel) fragments into frames.
> It'd work just fine.

I'm not saying it does not work, I'm saying that there are two ways to
do the operations :
  - triple copy to make use of TSO and benefit from hardware acceleration,
    but at 100Gbps, the less copies the better ;

  - zero copy and many very small splice() calls that are worthless at
    these sizes.

> > Sure this is not something a common server or user-agent is even able
> > to detect at regular loads. But you (and you particularly) know like
> > me that intermediaries need to shave off everything possible to avoid
> > wasting time doing dumb things such as copying small data or visiting
> > the same byte twice, etc...
> >
> >
> True (you've seen me push to reduce memory requirements), though my
> motivation w.r.t. CPU consumption is to balance any change in resource
> requirements against the change in the user's experience.

I know, we've discussed this a lot in paris. I'm not suggesting to harm
user experience, just to adjust the balance point so that when that makes
sense, applications (and thus the user's experience) do not suffer too
much from some limited initial choices.

I do expect a lot of savings from mux on the front side for small objects
for normal sites. But they're not the only ones using my product and users
pushing for larger numbers in all dimensions are becoming more common every
day. I want to pass the 100G barrier because there's increasing pressure
for it from people deploying now for the next few years, it's not just for
the fun of exhibiting large numbers (which is really enjoyable, I confess).
And I really don't want H2 to force a step back into the 10G era :-/

Best regards,
Willy

Received on Wednesday, 25 June 2014 21:10:38 UTC