Re: More on allowed field characters

--------
Roy T. Fielding writes:

> I am fine with HPACK also being used to convey UTF-8 named fields and/or
> carrying binary field values, but only when that is clearly indicated
> via the protocol and processed as such.

I looked into this as part of Structured Headers, and I can say
rather conclusively that nobody competent would do that.

The average symbol length in HPACK's huffman table is 18.2 bits, so 
high entropy binary data, be it due to compression or encryption,
encodes to more than twice the original size, +128% to be precise.

The HPACK huffman table could have been designed to minimize UTF-8's
penalty, at no cost to the lower 128 ASCII characters, but it almost
looks like the opposite was attempted.

It is impossible to put a representative number on the UTF-8
pessimization, but given the magnitude of it, I think the original
"frownie", U+2639, is a proper example:

In UTF-8 it becomes (0xe2, 0x98, 0xb9) which HPACK expands to 65 bits.

In comparison "\u2639" only takes 48 bits.

According to my experiments, base64 is the optimal HPACK encoding
for high entropy binary data, obviously reflecting its popularity
in the random sample of HTTP headers that went into the HPACK table
design.

The base-64 characters average 6.46 bits per symbol making the
overhead just:

	4 * 6.46 / 3 = 8.62 bits/byte = 7.8% 

Poul-Henning

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

Received on Friday, 27 August 2021 06:49:56 UTC