- From: Willy Tarreau <w@1wt.eu>
- Date: Sat, 9 Feb 2013 14:33:41 +0100
- To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Cc: Mark Nottingham <mnot@mnot.net>, James M Snell <jasnell@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
On Sat, Feb 09, 2013 at 09:36:57PM +0900, "Martin J. Dürst" wrote: > On 2013/02/09 8:53, Mark Nottingham wrote: > >My .02 - > > > >RFC2616 implies that the range of characters available in headers is > >ISO-8859-1 > > That's a leftover from the *very* early 1990s, when ISO-8859-1 was > actually a step forward from the various 'national' ISO-646 7-bit > encodings. It was not a bad idea at that time by TimBL to make the Web > work throughout Western Europe. UTF-8 wasn't even invented then. > (see http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt) > > The IETF understood the advantages of UTF-8 in the late 1990s, see > http://tools.ietf.org/html/rfc2277#section-3.1 > > These days, UTF-8 isn't a step forward, it's just plain obvious. The > browser folks at WHATWG would prefer not to have any Web pages in > anything else than UTF-8 anymore. That will take quite some time yet, > but the trend is very clear. See e.g. > http://googleblog.blogspot.jp/2010/01/unicode-nearing-50-of-web.html and > http://w3techs.com/technologies/details/en-utf8/all/all. Websockets was > designed with UTF-8 and binary built in from the start. For all kinds of > other protocols, UTF-8 is a non-brainer, too. > > It would be a good idea to try hard to make the new protocol forward > looking (or actually just acknowledge the present, rather than stay > frozen in the past) for the next 20 years or so in terms of character > encoding, too, and not only in terms of CPU/network performance. Well, don't confuse UTF-8 and Unicode. UTF-8 is just a space-efficient way of transporting Unicode characters for western countries. The encoding can become inefficient to transport for other charsets by inflating data by up to 50% and may make compression less efficient. Also, processing it is particularly inefficient as you have to parse each and every byte to find a length, making string comparisons quite slow. I'm not saying I'm totally against UTF-8 in HTTP/2 (eventhough I hate using it), I'm saying that it's not *THE* solution to every problem. It's just *A* solution to *A* problem : "how to extend character sets in existing documents without having to re-encode them all". I don't think this specific problem is related to the scope of the HTTP/2 work, so at first glance, I'd say that UTF-8 doesn't seem to solve a known problem here. Regards, Willy
Received on Saturday, 9 February 2013 13:34:15 UTC