Re: Dictionary Compression for HTTP (at Facebook)

On 09/01/2018 09:05 PM, Benjamin Kaduk wrote:> One topic that came up 
during IESG review of draft-kucherawy-dispatch-zstd was
> whether/when third-party or standard dictionaries would become available and how
> dictionary IDs would be assigned for those cases (since at present, IIUC, the
> dictionary IDs would need to be pre-negotiated between the two parties).  No
> IANA registry was created at that time, but with a 4-byte dictionary identifier space
> to work with, it seems like there might be space to create a registry for dictionary
> IDs (including private use space, of course), and just publishing well-known
> dictionaries.

Yes, we continue to think about whether and how to produce a standard 
set of dictionaries for public consumption. Zstandard reserves 
dictionary IDs 1-32767 for that purpose.

Dictionaries become more effective when they are targeted towards / 
trained on a narrower set of content. A solution that lets site 
operators build and use their dictionaries will enable sufficiently 
motivated parties to achieve the best possible compression. Zstandard 
provides tooling for that purpose, allowing users to easily train and 
use their own dictionaries.

OTOH, distributing and storing dictionaries is not without cost, and so 
a great number of highly targeted dictionaries introduces its own 
inefficiencies. So even in a world with custom dictionaries, we think 
that a standard set of dictionaries probably has utility. Site operators 
who don't expect enough repeat traffic to amortize the cost of 
distributing a custom dictionary, or who don't want to expend the effort 
of building custom dictionaries, could simply use them. And a standard 
set of dictionaries would certainly enable shipping "batteries-included" 
plugins to HTTP servers, lowering the barrier to use.

Building a standard set of dictionaries is not trivial, though. We 
recently performed experiments training a set of dictionaries on a 
dataset from the HTTP Archive[1]. We found that performance degrades 
significantly over time. A dictionary trained on 2016 traffic and 
applied to 2018 traffic performs worse than a 2018 dictionary does on 
2018 traffic (anywhere from one to five percent compression ratio loss 
per year).

So ideally, even in the context of a standard set of dictionaries, we 
would find a way to update or introduce new dictionaries as time goes on.

- Felix

[1] https://httparchive.org/

Received on Wednesday, 5 September 2018 20:02:22 UTC