Re: Delta Compression and UTF-8 Header Values

Hello Martin,

On Sun, Feb 10, 2013 at 02:02:46PM +0900, "Martin J. Dürst" wrote:
> >The encoding can
> >become inefficient to transport for other charsets by inflating data by up
> >to 50%
> 
> Well, that's actually an urban myth. The 50% is for CJK 
> (Chinese/Japanese/Korean).

With the fast development of China, it is perfectly imaginable that
in 10 years, a significant portion of the web traffic is made with
Chineese URLs, so we must not ignore that.

> For the languages/scripts of India, South 
> East Asia, and a few more places, it can be 200%. (For texts purely in 
> an alphabet in the Supplemental planes such as Old Italic, Shavian, 
> Osmanya,..., it can be 300%, but I guess we can ignore these.) But these 
> numbers only apply to cases that don't contain any ASCII at all.

I don't see how this is possible since you have 6 bits of data per byte
plus a few bits on the first byte, and you need 3 bytes to transport 16
bits, which is 50% for me :-)

> >and may make compression less efficient.
> 
> That depends very much on the method of compression that's used.

I agree, but adding unused bits or entropy in general will make compression
algorithms less efficient.

> >I'm not saying I'm totally against UTF-8 in HTTP/2 (eventhough I hate using
> >it), I'm saying that it's not *THE* solution to every problem. It's just 
> >*A*
> >solution to *A* problem : "how to extend character sets in existing 
> >documents
> >without having to re-encode them all". I don't think this specific problem 
> >is
> >related to the scope of the HTTP/2 work, so at first glance, I'd say that
> >UTF-8 doesn't seem to solve a known problem here.
> 
> The fact that I mentioned Websockets may have lead to a 
> misunderstanding. I'm not proposing to use UTF-8 only in bodies, just in 
> headers (I wouldn't object, though). My understanding was that James was 
> talking about headers, and I was doing so, too.

I was talking about header values too. As a developer of intermediaries,
I'm not interested in the body at all. I'm seeing people do ugly things
all the time, like regex-matching hosts with ".*\.example\.com" without
being aware how slow it is to do that on each and every Host header field.
Typically doing that with an UTF-8 aware library is even slower.

That's why I'm having some concerns.

Ideally, everything we transport should be in its original form. If hosts
come from DNS, they should appear encoded as they were returned by the DNS
server (even with the ugly IDN format). If paths are supposed to be UTF-8,
let them be sent in their raw original UTF-8 form without changing the
format. But then we don't want to mix Host and path, and we want to put as
a first rule that only the shortest forms are allowed. If most header fields
are pure ASCII (eg: encodings), declare them as such. If some header fields
are enums, use enums and not text. Etc...

Regards,
Willy

Received on Sunday, 10 February 2013 07:27:17 UTC