Re: More on allowed field characters

> On Aug 26, 2021, at 11:49 PM, Poul-Henning Kamp <phk@phk.freebsd.dk> wrote:
> 
> --------
> Roy T. Fielding writes:
> 
>> I am fine with HPACK also being used to convey UTF-8 named fields and/or
>> carrying binary field values, but only when that is clearly indicated
>> via the protocol and processed as such.
> 
> I looked into this as part of Structured Headers, and I can say
> rather conclusively that nobody competent would do that.

That's funny. I just spent three years updating a 30 year old protocol
and am quite sure that (aside from Referer and TE) field names are
rarely chosen for efficiency. If-Moderated-Since is my worst.

> The average symbol length in HPACK's huffman table is 18.2 bits, so 
> high entropy binary data, be it due to compression or encryption,
> encodes to more than twice the original size, +128% to be precise.
> 
> The HPACK huffman table could have been designed to minimize UTF-8's
> penalty, at no cost to the lower 128 ASCII characters, but it almost
> looks like the opposite was attempted.
> 
> It is impossible to put a representative number on the UTF-8
> pessimization, but given the magnitude of it, I think the original
> "frownie", U+2639, is a proper example:
> 
> In UTF-8 it becomes (0xe2, 0x98, 0xb9) which HPACK expands to 65 bits.
> 
> In comparison "\u2639" only takes 48 bits.
> 
> According to my experiments, base64 is the optimal HPACK encoding
> for high entropy binary data, obviously reflecting its popularity
> in the random sample of HTTP headers that went into the HPACK table
> design.
> 
> The base-64 characters average 6.46 bits per symbol making the
> overhead just:
> 
> 	4 * 6.46 / 3 = 8.62 bits/byte = 7.8% 

That's nice to know. Maybe we should add a "check your HPACK length"
resource somewhere, or a new flag on curl for evaluating extension names.

....Roy

Received on Friday, 27 August 2021 16:13:23 UTC