- From: Matthew Kerwin <matthew@kerwin.net.au>
- Date: Tue, 6 Jan 2015 09:20:06 +1000
- To: Frédéric Kayser <f.kayser@free.fr>
- Cc: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
- Message-ID: <CACweHNCZPUbWczcRDDNYuL6-+NwU44KqrN8rDynZz54ToYwT0g@mail.gmail.com>
On 6 January 2015 at 08:52, Frédéric Kayser <f.kayser@free.fr> wrote: > Hello, > to me the major drawback of HPACK is not really technical, I don't really > bother if it could be 20% faster or shrink data even further using topnotch > ANS/FSE compression instead of using a compression scheme even older than > my parents, it's a bit more philosophical: the length of the Huffman codes* > are biased towards a subset of ASCII to the point it becomes pointless to > try to compress something that is not plain English and this is a major > spit in the face of international users, just put two or three code points > that are outside of ASCII and blamo! you take such a huge penalty that you > can forget about using compression and go straight to the uncompressed way. > This is the 21st century Unicode as taken all over the Web, nowadays IRIs > can be written in Japanese, Arabic, Russian or Greek, but deep inside > HTTP/2 ASCII and percent-encoded strings still rule (revising RFC 3987 > would be welcome a some point). Refrain from using your mother lingo in > HTTP/2 headers, it's not open to the wide world of tomorrow since it's > based on stats from yesterday. > > *30 bits long codes is ridiculous and makes code slower for 32-bits CPU > capping them to 16 or 15 bits would have no impact on overall compression > (since hitting such large codes would still make it pointless to use). I > still don't get why the Huffman part tries to be a universal encoder since > in practice it can only really compress a small subset of ASCII and > anything else and especially UTF-8 quickly expands, I'd rather see some > kind of VLE clearly geared toward this subset (would be more effective) and > not trying to be universal at all. This way if the string is only made of > code points from the subset and will compress pretty well, otherwise record > it uncompressed (don't even try to encode it). > > My two cents > > I'm pretty sure HPACK was targeted towards a corpus of real world HTTP data. Since we're not rewriting HTTP semantics (just the way bits are put on the wire), and since most HTTP headers use (/require) plain ASCII values, what do we gain by targeting values that aren't in use (/allowed)? Case in point: whether or not IRIs are a thing, the :path pseudo-header is explicitly drawn from the *URI*, which is percent-encoded ASCII. That said, if one day we do rewrite the semantics, and allow UTF-8 everywhere, then by all means compression (among many other things) will have to be readdressed. -- Matthew Kerwin http://matthew.kerwin.net.au/
Received on Monday, 5 January 2015 23:20:34 UTC