RE: comments about draft-ietf-httpbis-header-compression from Mike Bishop on 2015-01-05 (ietf-http-wg@w3.org from January to March 2015)

From: Mike Bishop <Michael.Bishop@microsoft.com>
Date: Mon, 5 Jan 2015 23:46:24 +0000
To: Matthew Kerwin <matthew@kerwin.net.au>, Frédéric Kayser <f.kayser@free.fr>
CC: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Message-ID: <BL2PR03MB1328717068FEF861E173EEE87580@BL2PR03MB132.namprd03.prod.outlook.com>

The other point to remember here is that the “compression” attempts to avoid sending literals where possible.  *When a literal must be sent*, it can optionally be Huffman-encoded.  You’re correct that literals which are primarily non-ASCII probably won’t benefit from the Huffman-encoding, but they still benefit from avoiding the literal in the first place.

And Matthew is correct that the code point distribution in the Huffman table is drawn from real-world header usage, which demonstrates a bias toward ASCII.

From: phluid61@gmail.com [mailto:phluid61@gmail.com] On Behalf Of Matthew Kerwin
Sent: Monday, January 5, 2015 3:20 PM
To: Frédéric Kayser
Cc: ietf-http-wg@w3.org
Subject: Re: comments about draft-ietf-httpbis-header-compression

On 6 January 2015 at 08:52, Frédéric Kayser <f.kayser@free.fr<mailto:f.kayser@free.fr>> wrote:
Hello,
to me the major drawback of HPACK is not really technical, I don't really bother if it could be 20% faster or shrink data even further using topnotch ANS/FSE compression instead of using a compression scheme even older than my parents, it's a bit more philosophical: the length of the Huffman codes* are biased towards a subset of ASCII to the point it becomes pointless to try to compress something that is not plain English and this is a major spit in the face of international users, just put two or three code points that are outside of ASCII and blamo! you take such a huge penalty that you can forget about using compression and go straight to the uncompressed way. This is the 21st century Unicode as taken all over the Web, nowadays IRIs can be written in Japanese, Arabic, Russian or Greek, but deep inside HTTP/2 ASCII and percent-encoded strings still rule (revising RFC 3987 would be welcome a some point). Refrain from using your mother lingo in HTTP/2 headers, it's not open to the wide world of tomorrow since it's based on stats from yesterday.

*30 bits long codes is ridiculous and makes code slower for 32-bits CPU capping them to 16 or 15 bits would have no impact on overall compression (since hitting such large codes would still make it pointless to use). I still don't get why the Huffman part tries to be a universal encoder since in practice it can only really compress a small subset of ASCII and anything else and especially UTF-8 quickly expands, I'd rather see some kind of VLE clearly geared toward this subset (would be more effective) and not trying to be universal at all. This way if the string is only made of code points from the subset and will compress pretty well, otherwise record it uncompressed (don't even try to encode it).

My two cents

I'm pretty sure HPACK was targeted towards a corpus of real world HTTP data. Since we're not rewriting HTTP semantics (just the way bits are put on the wire), and since most HTTP headers use (/require) plain ASCII values, what do we gain by targeting values that aren't in use (/allowed)?

Case in point: whether or not IRIs are a thing, the :path pseudo-header is explicitly drawn from the *URI*, which is percent-encoded ASCII.

That said, if one day we do rewrite the semantics, and allow UTF-8 everywhere, then by all means compression (among many other things) will have to be readdressed.

--
  Matthew Kerwin
  http://matthew.kerwin.net.au/

Received on Monday, 5 January 2015 23:46:54 UTC