RE: multiplexing -- don't do it from Peter L on 2012-03-30 (ietf-http-wg@w3.org from January to March 2012)

From: Peter L <bizzbyster@gmail.com>
Date: Fri, 30 Mar 2012 15:49:46 -0400
To: "'Brian Pane'" <brianp@brianp.net>
Cc: <ietf-http-wg@w3.org>
Message-ID: <037e01cd0eae$43248e60$c96dab20$@gmail.com>
I agree that increasing concurrent connections will increase the burden on
web servers and that is a serious issue for sure but since so many sites are
already working around the 6 per domain limit via sharding, most site owners
are willing to accept higher numbers of TCP connections if it results in
faster page loads. Prevalence of domain sharding is a kind of vote in the
direction of increasing the per domain limit.

What I've found empirically is that most sites suffer from request
serialization--i.e., insufficient parallelism--despite all the investment in
domain sharding and image spriting. My article in last December's PerfPlanet
calendar 

presents the data.

 

Thanks for pointing me to your article. It's cool the way you attempt to
filter out content interdependencies by looking at images only, though I'm
sure that's not always accurate b/c page load logic can be built to wait for
images to load before issuing subsequent requests. It might be interesting
to include all content types but to count outstanding transactions per host
and only include serialized sequences that have 6 outstanding simultaneously
to filter your data. In any case, your result confirms what SPDY's test
results suggest -- browsers are still bumping up against the per domain
limits.

 

----

On a multi-core server, the three objects arrive truly concurrently, but due
to multiplexing Object2 and Object3 will need to wait while Object1 is
encoded. For SPDY, that encode step involves running an LZ-type coding
function including searching the recent bytes for matches so even on an
unloaded server this can add ~milliseconds of latency. 

The last time I looked at gzip perf, the cost was on the order of 50 clock
cycles/byte on x86_64. (Anybody who's studied LZ perf more deeply, please
jump in with more precise numbers.) Given 1KB of response headers, that
works out to ~25 microseconds of latency at 2GHz, not milliseconds.

 

Having worked at a load balancer company in the past, I do agree that 25us
is a material CPU cost, but it's nowhere near milliseconds.

 

I ran gzip at its default speed setting of 6 to encode 100 MB of highly
compressible text (similar to HTTP) and encoded it in about 9 seconds with a
2.4 GHz CPU, so you are right it only takes about 90us for a 1 KB response
header. And this would be better at faster gzip settings. I was thinking of
the time required to gzip the objects and not just the headers. I agree this
seems small for the gzipping alone.

 

----

.         Multiplexing creates the need for session state. Access to this
state needs to be synchronized, thread synchronization reduces parallelism
and so impacts server scalability and per object latency.

.         CPU gains are increasingly achieved by adding cores and not making
existing cores go faster. So processes that can run concurrently are
friendly to these advances (such as increasing concurrent TCP connections)
and multiplexing goes in the opposite direction -- requiring thread
synchronization and so increasing serialization, and context switching.

With separate connections, though, you still have a serialization bottleneck
at the NIC. The locking needed to serialized writes to the network doesn't
go away if you forego multiplexing in favor of lots of connections; it just
moves to the other side of the kernel/userspace boundary.

 

I hadn't thought of it this way. At the NIC level, does it matter that
packets are associated with different TCP connections? I'd think the
serialization to the network would be insensitive to L4 headers but don't
know enough about how this works.

 

Thanks,

 

Peter

 

 

From: Brian Pane [mailto:brianp@brianp.net] 
Sent: Friday, March 30, 2012 12:17 PM
To: Peter L
Cc: ietf-http-wg@w3.org
Subject: Re: multiplexing -- don't do it

 

On Friday, March 30, 2012, Peter L wrote:

Responding to Ross and Brian's posts mainly here...

 

I agree that increasing concurrent connections will increase the burden on
web servers and that is a serious issue for sure but since so many sites are
already working around the 6 per domain limit via sharding, most site owners
are willing to accept higher numbers of TCP connections if it results in
faster page loads. Prevalence of domain sharding is a kind of vote in the
direction of increasing the per domain limit.

 

What I've found empirically is that most sites suffer from request
serialization--i.e., insufficient parallelism--despite all the investment in
domain sharding and image spriting. My article in last December's PerfPlanet
calendar 

presents the data.

 

Transparency:

.         SPDY compresses HTTP headers using an LZ history based algorithm,
which means that previous bytes are used to compress subsequent bytes. So
any packet capture that does not include all the traffic sent over that
connection will be completely opaque -- no mathematical way to decode the
HTTP. Even with all the traffic, a stream decoder will be a tricky thing to
build b/c packets depend on each other.

 

I know there's a SPDY decoder plugin for Wireshark, but I'll defer to people
more knowledgeable about packet analysis tools to cover that area.

 

.         Loss of transparency impacts intermediary devices (reverse
proxies, caches, layer 7 switches, load balancers) as much as it does packet
capture analysis. For load balancing, multiplexing requires maintaining
state from one request to the next so individual object requests from a
given user will need to be handled by the same de-multiplexing server.

 

For load balancing, you just have to ensure that all packets from the same
TCP connection go to the same place for L6-7 decoding. But that's already
required for HTTP/1.x.  A L7 proxy or load balancer that terminates either
HTTP or SPDY is then free to dispatch successive requests from the same
client to different backend servers.

 

 In general, increasing session orientation reduces the scalability of the
overall service. Also, failover is less graceful as a load balancer will
want to be more sure that the previously used server is in fact unavailable
before routing to a new server.

.         SSL kills transparency at the network level completely but also I
think that SSL should be considered as an orthogonal thing to performance.
So that site owners can make a decision based on the cost, security,
performance tradeoffs of going to all encrypted traffic. So while I agree
it's related, it seems like we have to consider these things independently.

 

Increased Object Processing Latency:

.         Multiplexing requires that objects are encoded serially -- encode
(Object1), encode (Object2), encode (Object3) -- and then decoded in that
same order.

 

Object1, Object2, and Object3 need not be entire HTTP messages, though. In
SPDY, unlike pipelined HTTP/1.1, a server can interleave little chunks of
different responses.  That's what I consider SPDY's key design concept: not
just multiplexing, but interleaving.

 

 On a multi-core server, the three objects arrive truly concurrently, but
due to multiplexing Object2 and Object3 will need to wait while Object1 is
encoded. For SPDY, that encode step involves running an LZ-type coding
function including searching the recent bytes for matches so even on an
unloaded server this can add ~milliseconds of latency. 

 

The last time I looked at gzip perf, the cost was on the order of 50 clock
cycles/byte on x86_64. (Anybody who's studied LZ perf more deeply, please
jump in with more precise numbers.) Given 1KB of response headers, that
works out to ~25 microseconds of latency at 2GHz, not milliseconds.

 

Having worked at a load balancer company in the past, I do agree that 25us
is a material CPU cost, but it's nowhere near milliseconds.

 

.         Multiplexing creates the need for session state. Access to this
state needs to be synchronized, thread synchronization reduces parallelism
and so impacts server scalability and per object latency.

.         CPU gains are increasingly achieved by adding cores and not making
existing cores go faster. So processes that can run concurrently are
friendly to these advances (such as increasing concurrent TCP connections)
and multiplexing goes in the opposite direction -- requiring thread
synchronization and so increasing serialization, and context switching.

 

With separate connections, though, you still have a serialization bottleneck
at the NIC. The locking neede to serialized writes to the network doesn't go
away if you forego multiplexing in favor of lots of connections; it just
moves to the other side of the kernel/userspace boundary.

 

-Brian
Received on Friday, 30 March 2012 19:50:19 UTC