Re: Delta Compression and UTF-8 Header Values from Martin J. Dürst on 2013-02-10 (ietf-http-wg@w3.org from January to March 2013)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Sun, 10 Feb 2013 14:02:46 +0900
To: Willy Tarreau <w@1wt.eu>
CC: Mark Nottingham <mnot@mnot.net>, James M Snell <jasnell@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Message-ID: <511729F6.6000201@it.aoyama.ac.jp>

Hello Willy,

On 2013/02/09 22:33, Willy Tarreau wrote:
> On Sat, Feb 09, 2013 at 09:36:57PM +0900, "Martin J. Dürst" wrote:

>> It would be a good idea to try hard to make the new protocol forward
>> looking (or actually just acknowledge the present, rather than stay
>> frozen in the past) for the next 20 years or so in terms of character
>> encoding, too, and not only in terms of CPU/network performance.
>
> Well, don't confuse UTF-8 and Unicode.

As the main author of http://www.w3.org/TR/charmod/, I sure won't.

> UTF-8 is just a space-efficient way
> of transporting Unicode characters for western countries.

And for transporting ASCII-based commands/headers/markup together with 
non-ASCII data. That's the main reason the IETF adopted it.

> The encoding can
> become inefficient to transport for other charsets by inflating data by up
> to 50%

Well, that's actually an urban myth. The 50% is for CJK 
(Chinese/Japanese/Korean). For the languages/scripts of India, South 
East Asia, and a few more places, it can be 200%. (For texts purely in 
an alphabet in the Supplemental planes such as Old Italic, Shavian, 
Osmanya,..., it can be 300%, but I guess we can ignore these.) But these 
numbers only apply to cases that don't contain any ASCII at all.

> and may make compression less efficient.

That depends very much on the method of compression that's used.

> Also, processing it is
> particularly inefficient as you have to parse each and every byte to find
> a length, making string comparisons quite slow.

[See the follow-up mails in this thread.]

> I'm not saying I'm totally against UTF-8 in HTTP/2 (eventhough I hate using
> it), I'm saying that it's not *THE* solution to every problem. It's just *A*
> solution to *A* problem : "how to extend character sets in existing documents
> without having to re-encode them all". I don't think this specific problem is
> related to the scope of the HTTP/2 work, so at first glance, I'd say that
> UTF-8 doesn't seem to solve a known problem here.

The fact that I mentioned Websockets may have lead to a 
misunderstanding. I'm not proposing to use UTF-8 only in bodies, just in 
headers (I wouldn't object, though). My understanding was that James was 
talking about headers, and I was doing so, too.

Regards,   Martin.

Received on Sunday, 10 February 2013 05:03:21 UTC