Re: comments about draft-ietf-httpbis-header-compression from Matthew Kerwin on 2015-01-05 (ietf-http-wg@w3.org from January to March 2015)

From: Matthew Kerwin <matthew@kerwin.net.au>
Date: Tue, 6 Jan 2015 09:20:06 +1000
To: Frédéric Kayser <f.kayser@free.fr>
Cc: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Message-ID: <CACweHNCZPUbWczcRDDNYuL6-+NwU44KqrN8rDynZz54ToYwT0g@mail.gmail.com>

On 6 January 2015 at 08:52, Frédéric Kayser <f.kayser@free.fr> wrote:

> Hello,
> to me the major drawback of HPACK is not really technical, I don't really
> bother if it could be 20% faster or shrink data even further using topnotch
> ANS/FSE compression instead of using a compression scheme even older than
> my parents, it's a bit more philosophical: the length of the Huffman codes*
> are biased towards a subset of ASCII to the point it becomes pointless to
> try to compress something that is not plain English and this is a major
> spit in the face of international users, just put two or three code points
> that are outside of ASCII and blamo! you take such a huge penalty that you
> can forget about using compression and go straight to the uncompressed way.
> This is the 21st century Unicode as taken all over the Web, nowadays IRIs
> can be written in Japanese, Arabic, Russian or Greek, but deep inside
> HTTP/2 ASCII and percent-encoded strings still rule (revising RFC 3987
> would be welcome a some point). Refrain from using your mother lingo in
> HTTP/2 headers, it's not open to the wide world of tomorrow since it's
> based on stats from yesterday.
>
> *30 bits long codes is ridiculous and makes code slower for 32-bits CPU
> capping them to 16 or 15 bits would have no impact on overall compression
> (since hitting such large codes would still make it pointless to use). I
> still don't get why the Huffman part tries to be a universal encoder since
> in practice it can only really compress a small subset of ASCII and
> anything else and especially UTF-8 quickly expands, I'd rather see some
> kind of VLE clearly geared toward this subset (would be more effective) and
> not trying to be universal at all. This way if the string is only made of
> code points from the subset and will compress pretty well, otherwise record
> it uncompressed (don't even try to encode it).
>
> My two cents
>
>
I'm pretty sure HPACK was targeted towards a corpus of real world HTTP
data. Since we're not rewriting HTTP semantics (just the way bits are put
on the wire), and since most HTTP headers use (/require) plain ASCII
values, what do we gain by targeting values that aren't in use (/allowed)?

Case in point: whether or not IRIs are a thing, the :path pseudo-header is
explicitly drawn from the *URI*, which is percent-encoded ASCII.

That said, if one day we do rewrite the semantics, and allow UTF-8
everywhere, then by all means compression (among many other things) will
have to be readdressed.

-- 
  Matthew Kerwin
  http://matthew.kerwin.net.au/

Received on Monday, 5 January 2015 23:20:34 UTC