Re: Dictionary Compression for HTTP (at Facebook) from Felix Handte on 2018-08-24 (ietf-http-wg@w3.org from July to September 2018)

From: Felix Handte <felixh@fb.com>
Date: Fri, 24 Aug 2018 10:24:23 +0000
To: Mark Nottingham <mnot@mnot.net>, Jyrki Alakuijala <jyrki@google.com>
Cc: Charles McCathie-Neville <chaals@yandex-team.ru>, Evgenii Kliuchnikov <eustas@google.com>, Vlad Krasnov <vlad@cloudflare.com>, Nick Terrell <terrelln@fb.com>, Yann Collet <cyan@fb.com>, HTTP Working Group <ietf-http-wg@w3.org>
Message-Id: <652edc11-2d19-aef9-e3fd-ecb77ab47c1a@fb.com>
Jyrki,

Glad to hear it! We're excited too!

We look forward to playing with Brotli's new capabilities when they're available.

Yes, Zstd can accept unformatted buffers as dictionaries, which it uses as you describe. Additionally though, Zstd describes a format for structured dictionaries[1], which includes metadata in addition to an LZ77 prefix.

Mark,

I agree that hard thought is warranted about the security implications. Absolutely, if you have recommendations for who would be best to talk to, I'd welcome being pointed their way.

That was my mistake, I subscribed to the list from my personal email, but sent the message from my work email. since this was my first email to the list, it got captured into a review queue from which it hasn't yet escaped.

I wanted to briefly capture my understanding of where we're at. Broadly, it seems to me that there are fundamentally four questions that are relevant to this discussion:

(1) What are the eligible sources for content to be used as a dictionary by a particular HTTP request/response interaction?
(2) How should the client and server negotiate to use dictionaries at all, and how should they identify the particular content to use?
(3) How should the compressor make use of the dictionary content?
(4) Do these mechanisms introduce new security issues, and if so, how can they be mitigated?

Jyrki, would you mind briefly sharing where you guys stand with respect to these questions? Where do you contemplate proposing dictionaries be allowed to be drawn from? How do you intend to negotiate the use of a dictionary?

In particular, a derivative topic from questions (1) and (2) is whether and how the dictionary selection and negotiation process should be coupled to the negotiation/selection of the compression algorithm. Given that several of the content coding compressors (shared brotli, zstd, and gzip/deflate) support custom dictionaries, it seems desirable to find a mechanism that is agnostic to the algorithm in use. In fact, I can only imagine it making sense to tie a spec in this space to a specific compressor if there were a very strong reason to do so.

My reading of the shared brotli draft[2] is that, to the extent to which specific mechanisms for handling dictionaries are being contemplated by the Brotli team, they are strongly coupled to the use of Brotli and/or the Shared Brotli format. Is that your intent? If so, would you mind explaining why you feel that's advantageous over a generic solution?

For our own part, we find ourselves drawn towards a solution that makes a lot of the same choices as SDCH. That is, one that treats dictionaries as explicit resources that can be dynamically advertised by an origin, fetched and cached by a client, and then negotiated to be used in requests/responses between the two. The ability to treat a previous, cached response as a base on which to apply a "diff" (negotiated by ETag?) is also attractive to us.

Thanks,
Felix

[1] https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#dictionary-format <https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#dictionary-format>
[2] https://tools.ietf.org/html/draft-vandevenne-shared-brotli-format-01#section-9 <https://tools.ietf.org/html/draft-vandevenne-shared-brotli-format-01#section-9>

On 08/22/2018 11:13 PM, Mark Nottingham wrote:
> Hello Felix and Jyrki,
> Shared dictionary compression in various forms has been discussed in the Working Group for a fair amount of time. What's blocking progress is an agreed-to description of its security properties, issues therein, and acceptable mitigations for them.
> Some people have expressed interest in attempting to document that in an Internet-Draft, but to date we haven't seen any progress publicly. If you'd like, I can try to put you in touch with them to see if they need help, etc.
> Regards,
> P.S. Felix, your e-mail didn't make it to me, or into the archives. Are you subscribed to the list?
>> On 22 Aug 2018, at 6:23 pm, Jyrki Alakuijala <jyrki@google.com> wrote:
>> 
>> Fully agree! Sharing dictionaries is an amazing opportunity in making the internet faster and cheaper. SDCH never exploited that opportunity fully and it is great that we all are giving another go on this.
>> 
>> I presume Zstd dictionaries are simple:
>>  • fill lz77 buffer with bytes
>> ... and I know that Shared brotli dictionaries are relatively complex:
>>  • fill lz77 buffer with bytes, or,
>>  • add special meaning for unique (distance, length) pairs (2 % more density than filling lz77 buffer with bytes), or,
>>  • perform a binary diff on patch data (makes bsdiff obsolete by compressing 5–10 % more than bsdiff+brotli, can by 95+ % more dense than traditional lz77 dictionary for patching).
>>  • when distance overflows for unique (distance, length) pairs, a customized word transform is applied (gives 2 % more density)
>>  • context modeling: dictionary ordering of interpretation of (distance, length) pairs may depend on the last two bytes (unknown gains, I anticipate 1 %)
>> For data like the Google search result pages we can see a reduction of ~50 % in data when we go from "br" Brotli to Shared Brotli, and naturally very significant latency wins. Having binary diffing within shared dictionary infrastructure can allow patches for web packaging, Android apps, fonts, or other complex structured data to be efficiently compressed with shared dictionary by just using the previous version of that data as a dictionary.
>> 
>> 
>> 
>> On Wed, Aug 22, 2018 at 2:30 AM, Felix Handte <felixh@fb.com> wrote:
>> Hello all,
>> 
>> Quick introduction: I'm an engineer on the Data Compression team at Facebook. While we partner with other teams here to apply compression internally at Facebook, we primarily maintain the open source Zstandard[1][2] and LZ4[3] libraries.
>> 
>> We've seen enormous success leveraging dictionary-based compression with Zstd internally, and I'm starting to look at how we can apply the same toolkit/approach to compressing our public web traffic. As we're thinking about how to do this, both as a significant origin of HTTP traffic and as maintainers of open source compression tools, we want very much to pursue a course of action that is constructive for the broader community.
>> 
>> There are, by my count, three competing proposals for how this sort of thing might work (SDCH[4], Compression Dictionaries for HTTP/2[5], and Shared Brotli Dictionaries[6]+[7]). With no public consensus around how to do this well, it's tempting for us to simply build on the tooling we've built internally, and apply it to our traffic between our webservers and our mobile apps (where we control both ends of the connection and can do anything we want). However, it would be pretty tragic for Facebook to gin up its own spec and implementation in this space, roll it out, and end up with something mutually incompatible with anyone else's efforts, further fragmenting the community and driving consensus further off.
>> 
>> So I wanted to first resurface this topic with you all. In short, is there anyone still interested in pursuing a standard covering these topics? If so, I would like to work with you and help build something in this space that can actually see adoption.
>> 
>> Thanks,
>> Felix
>> 
>> [1] https://github.com/facebook/zstd
>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dkucherawy-2Ddispatch-2Dzstd-2D03&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=oTwXA79Pkx6Pzji3vhGR4A&m=PMTxmlcVv81wLtYeJ-1OSrLBQshZ8qFJXD0_hE2Dm7o&s=0WbAaGBnEaii-Se8oX3hykAULz2ELjd0BpsiWnUtJn8&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dkucherawy-2Ddispatch-2Dzstd-2D03&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=oTwXA79Pkx6Pzji3vhGR4A&m=PMTxmlcVv81wLtYeJ-1OSrLBQshZ8qFJXD0_hE2Dm7o&s=0WbAaGBnEaii-Se8oX3hykAULz2ELjd0BpsiWnUtJn8&e=>
>> [3] https://github.com/lz4/lz4 <https://github.com/lz4/lz4>
>> [4] https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dlee-2Dsdch-2Dspec-2D00&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=oTwXA79Pkx6Pzji3vhGR4A&m=PMTxmlcVv81wLtYeJ-1OSrLBQshZ8qFJXD0_hE2Dm7o&s=xSsC6Y5cdBQH26kF2EsHBo14ro6umSHXxj5isrqUROs&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dlee-2Dsdch-2Dspec-2D00&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=oTwXA79Pkx6Pzji3vhGR4A&m=PMTxmlcVv81wLtYeJ-1OSrLBQshZ8qFJXD0_hE2Dm7o&s=xSsC6Y5cdBQH26kF2EsHBo14ro6umSHXxj5isrqUROs&e=>
>> [5] https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dvkrasnov-2Dh2-2Dcompression-2Ddictionaries-2D03&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=oTwXA79Pkx6Pzji3vhGR4A&m=PMTxmlcVv81wLtYeJ-1OSrLBQshZ8qFJXD0_hE2Dm7o&s=tpEFC6Asr_xaTeEcJr4UWozbPOwBs1LQUVJ-s9eNlYU&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dvkrasnov-2Dh2-2Dcompression-2Ddictionaries-2D03&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=oTwXA79Pkx6Pzji3vhGR4A&m=PMTxmlcVv81wLtYeJ-1OSrLBQshZ8qFJXD0_hE2Dm7o&s=tpEFC6Asr_xaTeEcJr4UWozbPOwBs1LQUVJ-s9eNlYU&e=>
>> [6] https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dvandevenne-2Dshared-2Dbrotli-2Dformat-2D01&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=oTwXA79Pkx6Pzji3vhGR4A&m=PMTxmlcVv81wLtYeJ-1OSrLBQshZ8qFJXD0_hE2Dm7o&s=Y30ZAb-fBmy4gAt6S37X1t7DuD0qZPGX44u9beu8s24&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dvandevenne-2Dshared-2Dbrotli-2Dformat-2D01&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=oTwXA79Pkx6Pzji3vhGR4A&m=PMTxmlcVv81wLtYeJ-1OSrLBQshZ8qFJXD0_hE2Dm7o&s=Y30ZAb-fBmy4gAt6S37X1t7DuD0qZPGX44u9beu8s24&e=>
>> [7] https://github.com/google/brotli/wiki/Fetch-Specification <https://github.com/google/brotli/wiki/Fetch-Specification>
>> 
>> 
> --
> Mark Nottingham   https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mnot.net_&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=oTwXA79Pkx6Pzji3vhGR4A&m=PMTxmlcVv81wLtYeJ-1OSrLBQshZ8qFJXD0_hE2Dm7o&s=CVOwcJi5LnOBvYeEnIyqEsxyEavnNj9CB0O7W547ToE&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mnot.net_&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=oTwXA79Pkx6Pzji3vhGR4A&m=PMTxmlcVv81wLtYeJ-1OSrLBQshZ8qFJXD0_hE2Dm7o&s=CVOwcJi5LnOBvYeEnIyqEsxyEavnNj9CB0O7W547ToE&e=>
Received on Friday, 31 August 2018 05:58:03 UTC