Improving on Huffman

The Huffman code for header compression seems to leave a lot on the table for the sake of statelessness. Some quick measurements on URLs from one .har file expose a likely optimization that shouldn't add vulnerability.

60%+ of characters in :path: strings follow one of the same kind, i.e. numeral after numeral or lowercase after lowercase. Those that don’t, are usually followed by one that does. (E.g., in “key=value” the “y” predicts the character class of “v.”) This is sufficient to predict the high 3 bits of most characters, and also to map uppercase and lowercase characters to a common encoding. The current encoding effectively takes ~2 extra bits to uppercase a given character.

It seems hard to imagine an exploit that could leverage three bits of state (one of which merely tells us that it’s not binary data), stored only from each character to its successor. On the other hand, the current scheme causes binary data to expand to ~300% of original size, and makes uppercase characters and punctuation noticeably larger than anything else, making any content which is encoded into text stick out obviously (not to mention that it doesn’t get compressed!). All this would seem to make it easier to pinpoint the size of certain data items. Conversely, if a high-water mark is used to determine how much padding is used to hide the size of the information, the worst case is what determines the ultimate efficacy of the compression. So, good performance over broader input would help security-minded users. And if we don’t address them, what’s the purpose of going stateless in the first place?

Should I pursue a minimally-stateful scheme for comparison? There should be a double-digit incremental improvement.

To be clear: I don’t claim to be a security or data compression expert.

Received on Monday, 17 March 2014 08:11:05 UTC