RE: multiplexing -- don't do it from Peter L on 2012-03-30 (ietf-http-wg@w3.org from January to March 2012)

From: Peter L <bizzbyster@gmail.com>
Date: Fri, 30 Mar 2012 16:36:47 -0400
To: <ietf-http-wg@w3.org>
Message-ID: <03a001cd0eb4$d4d0ac80$7e720580$@gmail.com>
 

" Prioritization is key in efficiently utilizing any for of
parallel-requests. It happens to be much, much more difficult if you're
using separate connections because there is no guarantee that they go to the
same machine. As a result, your background image gets to clog the pipe
instead of your browser getting the HTTP and JS it needs to do initial
layout, rendering, and resource discovery."

The fact that SPDY enables the browser to tag a request as high priority
when in fact a resource is blocking is awesome. This should be built into
HTTP 2.0 with or without the multiplexing piece. It allows either SPDY
server, or other intermediaries in the network (assuming the HTTP is not
opaque) that are doing shaping to make better decisions. And only the
browser has this information. 

 

1 connection means that it becomes trivial to do prioritization properly.

I know that it was argued that 1 connection makes it more difficult because
there is buffering and you can't revoke a write() to a socket. Experience so
far hasn't bourne this fear out, and even if it was true, if you can get
some idea about the depth of your buffer, you can pace your output to ensure
that you're never adding too much buffer-depth at any point in time.

Have you done much testing with packet loss? As expressed earlier, my
biggest concern is head-of-line blocking when low priority objects are at
the head of the line. But it makes sense that you could mitigate this
somewhat by making sure you don't buffer too much in the network.  Also, the
approach I like of allowing a larger pool of persistent connections per
domain increases the likelihood that a SYN or SYN-ACK will be dropped, which
results in the disastrous 3 second reconnect delay.

Re: no decoder plugin for wireshark.

"Seems like a fine tradeoff for the latency savings that we get on low-BW
links, though."

I agree if SPDY could be applied to problematic links only. Otherwise it's a
lot to give up if you spend your days troubleshooting networks and
application performance.

"Again, multiplexing as it has been done with SPDY, since it is on one
connection, requires less synchronization than it does for HTTP (the kernel
has to do synchronization of various interactions as you increase the FD
count) to handle the increased parallelism. Even in the case where you
decide to use more than a constant number of threads per core (anything else
will suffer in throughput compared to that design on current kernels,
hardware from my experience), you will still have less contention because
you can manage it yourself with domain knowledge about the connection, user,
problem, method, etc. that the kernel isn't privy to and should probably
never be privy to."

Are you saying a SPDY-enabled web server outperforms one without SPDY for a
given web page b/c it converts many TCP connections into a single connection
and this actually decreases kernel level thread contention more than the
contention added by SPDY's user mode serialization logic? I wonder if this
is what the adopters (Google and Twitter) have found so far.

Thanks,

Peter

 

 

On Fri, Mar 30, 2012 at 6:17 PM, Brian Pane <brianp@brianp.net> wrote:

On Friday, March 30, 2012, Peter L wrote:

Responding to Ross and Brian's posts mainly here...

 

I agree that increasing concurrent connections will increase the burden on
web servers and that is a serious issue for sure but since so many sites are
already working around the 6 per domain limit via sharding, most site owners
are willing to accept higher numbers of TCP connections if it results in
faster page loads. Prevalence of domain sharding is a kind of vote in the
direction of increasing the per domain limit.

 

What I've found empirically is that most sites suffer from request
serialization--i.e., insufficient parallelism--despite all the investment in
domain sharding and image spriting. My article in last December's PerfPlanet
calendar 

presents the data.

 

Prioritization is key in efficiently utilizing any for of parallel-requests.
It happens to be much, much more difficult if you're using separate
connections because there is no guarantee that they go to the same machine.
As a result, your background image gets to clog the pipe instead of your
browser getting the HTTP and JS it needs to do initial layout, rendering,
and resource discovery.

 

1 connection means that it becomes trivial to do prioritization properly.

I know that it was argued that 1 connection makes it more difficult because
there is buffering and you can't revoke a write() to a socket. Experience so
far hasn't bourne this fear out, and even if it was true, if you can get
some idea about the depth of your buffer, you can pace your output to ensure
that you're never adding too much buffer-depth at any point in time.

 

 

Transparency:

.         SPDY compresses HTTP headers using an LZ history based algorithm,
which means that previous bytes are used to compress subsequent bytes. So
any packet capture that does not include all the traffic sent over that
connection will be completely opaque -- no mathematical way to decode the
HTTP. Even with all the traffic, a stream decoder will be a tricky thing to
build b/c packets depend on each other.

 

I know there's a SPDY decoder plugin for Wireshark, but I'll defer to people
more 

knowledgeable about packet analysis tools to cover that area.

 

The OP is right about this, btw. Technically it is possible that you've
flushed the window after 2k of completely new data, but there is no
guarantee and so interpreting a stream in the  middle may be extremely
difficult.

 

Seems like a fine tradeoff for the latency savings that we get on low-BW
links, though.

 

 

.         Loss of transparency impacts intermediary devices (reverse
proxies, caches, layer 7 switches, load balancers) as much as it does packet
capture analysis. For load balancing, multiplexing requires maintaining
state from one request to the next so individual object requests from a
given user will need to be handled by the same de-multiplexing server.

 

For load balancing, you just have to ensure that all packets from the same
TCP connection go to the same place for L6-7 decoding. But that's already
required for HTTP/1.x.  A L7 proxy or load balancer that terminates either
HTTP or SPDY is then free to dispatch successive requests from the same
client to different backend servers.

 

Note that 'the same place' probably means the same IP, but there is no
assurance that the same IP will mean the same machine or network adapter.
With multiplexing over TCP (or any equivalent like SCTP), you're either
guaranteed or at least much more likely to get locality for that user on one
loadbalancer or machine.

 

Using fewer connections decreases vastly the amount of state necessary to do
proper demux to the right server, and, as noted before, allows the LB or
server to trivially do prioritization.

 

 

 In general, increasing session orientation reduces the scalability of the
overall service. Also, failover is less graceful as a load balancer will
want to be more sure that the previously used server is in fact unavailable
before routing to a new server.

.         SSL kills transparency at the network level completely but also I
think that SSL should be considered as an orthogonal thing to performance.
So that site owners can make a decision based on the cost, security,
performance tradeoffs of going to all encrypted traffic. So while I agree
it's related, it seems like we have to consider these things independently.

 

Increased Object Processing Latency:

.         Multiplexing requires that objects are encoded serially -- encode
(Object1), encode (Object2), encode (Object3) -- and then decoded in that
same order.

 

Object1, Object2, and Object3 need not be entire HTTP messages, though. In
SPDY, unlike pipelined HTTP/1.1, a server can interleave little chunks of
different responses.  That's what I consider SPDY's key design concept: not
just multiplexing, but interleaving.

 

Multiplexing doesn't have any effect on encoding/decoding unless you're
using something requires serialization (such as the gzip compressor in
SPDY). In the case of SPDY, if you can jettison the header-stream
compression or better, find some compression method that doesn't have as
stringent a compression requirement, you can avoid this issue.

 

Note anyway that the serialization requirement that gzip imposes in SPDY
only affects the headers, and not the data for the request or response.

 

 

 On a multi-core server, the three objects arrive truly concurrently, but
due to multiplexing Object2 and Object3 will need to wait while Object1 is
encoded. For SPDY, that encode step involves running an LZ-type coding
function including searching the recent bytes for matches so even on an
unloaded server this can add ~milliseconds of latency. 

 

The last time I looked at gzip perf, the cost was on the order of 50 clock
cycles/byte on x86_64. (Anybody who's studied LZ perf more deeply, please
jump in with more precise numbers.) Given 1KB of response headers, that
works out to ~25 microseconds of latency at 2GHz, not milliseconds.

 

Having worked at a load balancer company in the past, I do agree that 25us
is a material CPU cost, but it's nowhere near milliseconds.

 

.         Multiplexing creates the need for session state. Access to this
state needs to be synchronized, thread synchronization reduces parallelism
and so impacts server scalability and per object latency.

There are a lot of ways to skin a cat. One need not always resort to
critical-section based synchronization. You could use epoll() instead, for
instance, and process the client's connection within on thread.

.         CPU gains are increasingly achieved by adding cores and not making
existing cores go faster. So processes that can run concurrently are
friendly to these advances (such as increasing concurrent TCP connections)
and multiplexing goes in the opposite direction -- requiring thread
synchronization and so increasing serialization, and context switching.

 

With separate connections, though, you still have a serialization bottleneck
at the NIC. The locking neede to serialized writes to the network doesn't go
away if you forego multiplexing in favor of lots of connections; it just
moves to the other side of the kernel/userspace boundary.

 

Most of the improvements you would see here for the multi-CPU case should
come in the form of multiqueue NICs which hash based on the TCP tuple (or
similar) and stably select a queue which feeds a non-overlapping subset of
CPUs to do processing of interrupts, etc.

 

Again, multiplexing as it has been done with SPDY, since it is on one
connection, requires less synchronization than it does for HTTP (the kernel
has to do synchronization of various interactions as you increase the FD
count) to handle the increased parallelism. Even in the case where you
decide to use more than a constant number of threads per core (anything else
will suffer in throughput compared to that design on current kernels,
hardware from my experience), you will still have less contention because
you can manage it yourself with domain knowledge about the connection, user,
problem, method, etc. that the kernel isn't privy to and should probably
never be privy to.

 

-=R

 

 

-Brian
Received on Friday, 30 March 2012 20:37:20 UTC