Regarding initial dictionary for HTTP http://tools.ietf.org/html/draft-ietf-httpbis-http2-00#ref-UDELCOMPRESSION

Hello,
I've just started to follow all those HTTP2.0 SPDY talks.
I have a good knowledge of how Deflate/Zlib compression works and when I look at the chapter  2.6.10.1.  Compression of SPDY, I'm some what surprised by the content of SPDY_dictionary_txt[].

First of all, I suppose that HTTP 2.0 will not identify itself as "HTTP/1.1".
     0x00, 0x08, 0x48, 0x54, 0x54, 0x50, 0x2f, 0x31,  \\ - - H T T P - 1
     0x2e, 0x31, 0x00, 0x00, 0x00, 0x03, 0x75, 0x72,  \\ - 1 - - - - u r

And it favors old image formats:
     0x68, 0x74, 0x6d, 0x6c, 0x2c, 0x69, 0x6d, 0x61,  \\ h t m l - i m a
     0x67, 0x65, 0x2f, 0x70, 0x6e, 0x67, 0x2c, 0x69,  \\ g e - p n g - i
     0x6d, 0x61, 0x67, 0x65, 0x2f, 0x6a, 0x70, 0x67,  \\ m a g e - j p g
     0x2c, 0x69, 0x6d, 0x61, 0x67, 0x65, 0x2f, 0x67,  \\ - i m a g e - g
     0x69, 0x66, 0x2c, 0x61, 0x70, 0x70, 0x6c, 0x69,  \\ i f - a p p l i

There are no provision made for SVG or WebP MIME types, since the dictionary itself is based on data gathered a couple of years ago it does not look forward to catch new trends (HTML5 video types...)


There are a few interesting statements in this document:
http://www.eecis.udel.edu/~amer/PEL/poc/pdf/SPDY-Fan.pdf

"3.3 Frequency to Update the Initial Dictionary
An update of zlib’s initial dictionary for SPDY requires modifying both peer end points (client browser and server), so changes are problematic. If the initial dictionary is not permitted to change or can only change infrequently (e.g., every few years), we want today's initial dictionary to be appropriate in the future."

"4 Methodology to Derive the Initial Dictionary
...
4 Build an initial dictionary for HTTP replies based on the weight of each keyword in HTTP reply headers. Build an initial dictionary for HTTP requests based on the count of each keyword in HTTP request headers. Since the SPDY designers preferred to have just a single dictionary for both headers and replies, we then concatenate these two dictionaries. For HTTP header field names and some fixed length values (e.g., HTTP/1.1, 200 OK), we include the 32- bit length prefix before the word. (Note: a separate initial dictionary optimized for either HTTP requests or replies would likely provide better compression, but would in turn make SPDY’s implementation more complex.)"

I dont' see why it would be more complex to use two dictionaries (one is for what's going in and the other for what's going out, just cross them client/server based), if it leads to better compression this is the way to go, it should even be faster since smaller dictionaries lead to faster search.

And why wouldn't it be possible to include an update system? Servers would send new dictionaries every year or every month to catch new trends.
A transport protocol that walls itself in based on past statistics really lacks visionary qualities, I think this should not be overlooked.

In the zlib header there's a way to select different dictionaries:
RFC 1950       ZLIB Compressed Data Format Specification

      FDICT (Preset dictionary)
         If FDICT is set, a DICT dictionary identifier is present
         immediately after the FLG byte. The dictionary is a sequence of
         bytes which are initially fed to the compressor without
         producing any compressed output. DICT is the Adler-32 checksum
         of this sequence of bytes (see the definition of ADLER32
         below).  The decompressor can use this identifier to determine
         which dictionary has been used by the compressor.


At first glance the methodology used to produce the initial dictionary seems limited.

Apparently the dictionary is not sorted based on the frequency of the LZ77 matches, high frequency matches should be moved to the end of the dictionary to reduce the distance part of the LZ77 (length, distance) pair.
There are a lot of redundancies inside the dictionary itself, the goal is probably to catch an entire keyword name in a single LZ77 match, in the other hand this increases the dictionary size and requires longer distance and length references. Another approach would be to remove redundancies from the dictionary this could even produce overlapping sequences like "webpngif" that could be used to match "webp", "png" and "gif", it would replace a single match by two or three shorter ones (this will not necessarily produce a shorter compressed stream, but as apparently not been tried).

Regards
-- 
Frédéric Kayser

Received on Friday, 30 November 2012 11:33:24 UTC