Re: Delta Compression and UTF-8 Header Values from Willy Tarreau on 2013-02-09 (ietf-http-wg@w3.org from January to March 2013)

From: Willy Tarreau <w@1wt.eu>
Date: Sat, 9 Feb 2013 14:33:41 +0100
To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Cc: Mark Nottingham <mnot@mnot.net>, James M Snell <jasnell@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Message-ID: <20130209133341.GA8712@1wt.eu>

On Sat, Feb 09, 2013 at 09:36:57PM +0900, "Martin J. Dürst" wrote:
> On 2013/02/09 8:53, Mark Nottingham wrote:
> >My .02 -
> >
> >RFC2616 implies that the range of characters available in headers is 
> >ISO-8859-1
> 
> That's a leftover from the *very* early 1990s, when ISO-8859-1 was 
> actually a step forward from the various 'national' ISO-646 7-bit 
> encodings. It was not a bad idea at that time by TimBL to make the Web 
> work throughout Western Europe. UTF-8 wasn't even invented then.
> (see http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt)
> 
> The IETF understood the advantages of UTF-8 in the late 1990s, see 
> http://tools.ietf.org/html/rfc2277#section-3.1
> 
> These days, UTF-8 isn't a step forward, it's just plain obvious. The 
> browser folks at WHATWG would prefer not to have any Web pages in 
> anything else than UTF-8 anymore. That will take quite some time yet, 
> but the trend is very clear. See e.g. 
> http://googleblog.blogspot.jp/2010/01/unicode-nearing-50-of-web.html and
> http://w3techs.com/technologies/details/en-utf8/all/all. Websockets was 
> designed with UTF-8 and binary built in from the start. For all kinds of 
> other protocols, UTF-8 is a non-brainer, too.
> 
> It would be a good idea to try hard to make the new protocol forward 
> looking (or actually just acknowledge the present, rather than stay 
> frozen in the past) for the next 20 years or so in terms of character 
> encoding, too, and not only in terms of CPU/network performance.

Well, don't confuse UTF-8 and Unicode. UTF-8 is just a space-efficient way
of transporting Unicode characters for western countries. The encoding can
become inefficient to transport for other charsets by inflating data by up
to 50% and may make compression less efficient. Also, processing it is
particularly inefficient as you have to parse each and every byte to find
a length, making string comparisons quite slow.

I'm not saying I'm totally against UTF-8 in HTTP/2 (eventhough I hate using
it), I'm saying that it's not *THE* solution to every problem. It's just *A*
solution to *A* problem : "how to extend character sets in existing documents
without having to re-encode them all". I don't think this specific problem is
related to the scope of the HTTP/2 work, so at first glance, I'd say that
UTF-8 doesn't seem to solve a known problem here.

Regards,
Willy

Received on Saturday, 9 February 2013 13:34:15 UTC