Re: Broader discussion - limit dictionary encoding to one compression algorithm? from Patrick Meenan on 2024-05-22 (ietf-http-wg@w3.org from April to June 2024)

From: Patrick Meenan <patmeenan@gmail.com>
Date: Wed, 22 May 2024 10:03:03 -0400
To: HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <CAJV+MGwWLuujKUG1vAz8e8F0SoYWxN3KFv-nWfRYiFVk1Q9U2Q@mail.gmail.com>
It's probably worth noting that the draft is not specifying "Brotli" and
"Zstandard" but, rather, "dcb" and "dcz" which are specific parameters for
each (window size in particular) that lead to the restrictions I mentioned.
They are effectively the dictionary-equivalent of "zstd" and "br", both of
which use the same 8 and 16 MB windows respectively that "dcz" and "dcb"
define.

Dictionary compression for delta updates is more likely to benefit from
large window variants for use cases where you want to use http to deliver
delta updates of large files since the window and other params for each
directly impact the effectiveness of the delta encoding and size of
resources that they can be applied to. I would not be surprised to see
large/huge variants of the content encoding be defined and used outside of
the browser case and they can still leverage the same dictionary mechanism,
just with a different content-encoding (and would just need to define an
appropriate content-encoding).

There are other compression algorithms that are specific to resource types
that can do MUCH better delta encoding than what Zstandard and Brotli
provide in the general case. Courgette, for example:
https://www.chromium.org/developers/design-documents/software-updates-courgette/

I wouldn't be surprised if a better diff update were to be developed for ML
models that could do something better than pattern matching knowing the
format of the file (giant collection of weights), particularly given the
size of the Gen AI models where even the smallest are multiple gigabytes.

I don't expect dictionary updates over HTTP (using the compression
dictionary transport mechanism) will be limited to 1-2 content-encodings
for very long so the main question is if we define both "dcb" and "dcz" now
or only one of them and let other content-encodings follow for different
use cases in future RFCs.

I think it makes sense to spec the dictionary-aware versions of both "zstd"
and "br" since we already have both of them and they are both in broad use
and the parameters map directly to what is currently defined for "dbz" and
"dcb". This is effectively defining how the existing encodings should
behave when using dictionaries.

On Tue, May 21, 2024 at 1:02 PM Patrick Meenan <patmeenan@gmail.com> wrote:

>
>
> On Tue, May 21, 2024 at 12:41 PM Poul-Henning Kamp <phk@phk.freebsd.dk>
> wrote:
>
>> Patrick Meenan writes:
>>
>> > ** The case for a single content-encoding:
>> > […]
>> > ** The case for both Brotli and Zstandard:
>>
>> First, those are not really the two choices before us.
>>
>> Option one is:  Pick one single algorithm
>>
>> Option two is:  Add a negotiation mechanism and seed a new IANA registry
>> with those two algorithms
>>
>> As far as I can tell, there are no credible data which shows any
>> performance difference between the two, and no of reason to think that any
>> future compression algorithm will do significantly better.
>>
>
> We already have a negotiation mechanism.  It uses "Accept-Encoding" and
> "Content-Encoding" and the existing registry. Nothing about the negotiation
> changes if we use one, two or more. The question is if we specify and
> register the "dcb" content-encoding as well as the "dcz" content encoding
> as part of this draft or if we only register one (or if we also add a
> restriction that no other content encodings can use the dictionary
> negotiation).
>
> As far as future encodings, we don't know if any algorithms will do better
> but there is the potential for content-aware delta encodings to do better
> (with things like reallocated addresses in WASM, etc). More likely, there
> will probably come a time where someone wants to delta-encode
> multi-gigabyte resources where the 50/128MB limitations laid out for "dcb"
> and "dcz" won't work and a "large window" variant may need to be specified
> (as a new content encoding).
>
Received on Wednesday, 22 May 2024 14:03:21 UTC