Re: Speed concern related to bit order in Huffman (was: I-D Action: draft-ietf-httpbis-header-compression-10.txt) from Frédéric Kayser on 2014-12-12 (ietf-http-wg@w3.org from October to December 2014)

From: Frédéric Kayser <f.kayser@free.fr>
Date: Fri, 12 Dec 2014 03:29:54 +0100
To: HTTP Working Group <ietf-http-wg@w3.org>
Cc: Roberto Peon <grmocg@gmail.com>
Message-Id: <4E3C6DD1-76E0-481F-B69F-6EAB6A041EF9@free.fr>
Here are some figures, I took an existing Huffman HPACK implementation:
https://github.com/http2/http2-spec/wiki/Implementations
It looks like nghttp2 and H2O share some code at this level especially the decoding table:
https://github.com/tatsuhiro-t/nghttp2/blob/master/lib/nghttp2_hd_huffman_data.c

I took H2O code and tweaked it to fit into FSE benchmark code.
Wrote a bits-on-top encoder loosely based on Zlib code and its decoder counterpart (damn slow for now since it doesn't use tables) and since a found that using a full-blown Huffman encoder/decoder was a bit overkill in this case I also made some trials with a dumber variable length encoder called Fastlane.

When it comes to measuring speed using small data samples the problem is that your en/decoding code as hardly reached cruise speed that it already stops… since headers are not supposed to hold megabytes of data the sample used here is relatively small (183 bytes) larger samples would produce higher figures (but the overall results stay inline just scaled).

hpackori is H2O code, hpacktop is my code (bits-on-top), since H2O uses precomputed tables for decoding it blows away my code for now but when it comes to encoding speed it’s a bit different: quite faster on Intel, slightly slower on ARM. Bits-on-top is supposed to offer more flexibility on the decoding side, and from what I have seen H2O code does basic I/O when encoding, therefore I would not draw conclusions at this point.
But what stroke me most was the Huffman codes themselves, actually a small subset of ASCII (about 96 code points) enjoy relatively small codes whereas the others quickly get very long codes, in fact if you encounter too many of those codes you’re not going to compress your data but rather expand it and you’re probably better off recording this field uncompressed which leads to the question: how were the underlying statistics gathered? I mean if a particular message is shorter when stored uncompressed it should not take part in the overall stats to foster a more precise picture for codes that will effectively be stored in a compressed way (2 stages mecanism).
I’m even wondering why bother with an universal encoder that is tuned only for a small subset of ASCII and here came the idea for Fastlane. Fastlane is a dumb but fast variable length encoder it only knows 3 code length: 5 bits, 7 bits and 8 bits and can encode only 97 code points (17 of length 5, 40 of length 7 and again 40 of length 8) since the remaining code points are not handled a message holding one of those would have to be recorded uncompressed (or they could be recorded using 16 bits by reducing the base code points to 96 and introducing an escape symbol). Since it is quite simpler than a full-blown Huffman encoder it also works a bit faster (no overly long code to handle, it is quite happy with 16-bits logic) and decoding is way easier, the decoding speed reported here is without a lookup table (after having read the 5 first bits the full length of the code is instantaneously known) a lookup table may still bring some speed gains but would a larger memory foot print and waste some L1 cache, the caveat being that it will not always compress as well as a full Huffman encoder, and at this point I'm curious to see the impact of having Huffman on or off in HTTP/2 on a server CPU load (bit I/O is slow by nature at lot of short bit I/O bursts does not bode well for me).

Intel C2D 1.6GHz (64-bits)

hpackori -z bromix
bromix           :       183 ->       140 (76.50%),   76.9 MB/s ,   61.8 MB/s

hpacktop -z bromix
bromix           :       183 ->       140 (76.50%),   92.5 MB/s ,   34.5 MB/s

fastlane -z bromix
bromix           :       183 ->       140 (76.50%),  102.8 MB/s ,   78.3 MB/s


ARM Cortex A8 800MHz

hpackori -z bromix
bromix           :       183 ->       140 (76.50%),   24.7 MB/s ,   22.8 MB/s
 
hpacktop -z bromix
bromix           :       183 ->       140 (76.50%),   23.2 MB/s ,   12.8 MB/s

fastlane -z bromix
bromix           :       183 ->       140 (76.50%),   40.2 MB/s ,   28.4 MB/s

Regards
Frédéric Kayser

Le 2 déc. 2014 à 03:21, Roberto Peon <grmocg@gmail.com> a écrit :

> The best way to influence is to show the difference :)
> I remember bemoaning that little-endian won, which requires such things for speed for bitwise operations.
> 
> -=R
Received on Friday, 12 December 2014 02:34:15 UTC