Re: Delta Compression and UTF-8 Header Values from Zhong Yu on 2013-02-10 (ietf-http-wg@w3.org from January to March 2013)

From: Zhong Yu <zhong.j.yu@gmail.com>
Date: Sun, 10 Feb 2013 17:25:36 -0600
To: Willy Tarreau <w@1wt.eu>
Cc: Martin J. Dürst <duerst@it.aoyama.ac.jp>, Mark Nottingham <mnot@mnot.net>, James M Snell <jasnell@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Message-ID: <CACuKZqEsSaPLvFtTpLDSD8y3d2X2wdtQAqFciNESxLNk7ipTHw@mail.gmail.com>

On Sun, Feb 10, 2013 at 4:58 PM, Zhong Yu <zhong.j.yu@gmail.com> wrote:
> On Sun, Feb 10, 2013 at 1:26 AM, Willy Tarreau <w@1wt.eu> wrote:
>> Hello Martin,
>>
>> On Sun, Feb 10, 2013 at 02:02:46PM +0900, "Martin J. Dürst" wrote:
>>> >The encoding can
>>> >become inefficient to transport for other charsets by inflating data by up
>>> >to 50%
>>>
>>> Well, that's actually an urban myth. The 50% is for CJK
>>> (Chinese/Japanese/Korean).
>>
>> With the fast development of China, it is perfectly imaginable that
>> in 10 years, a significant portion of the web traffic is made with
>> Chineese URLs, so we must not ignore that.
>
> The problem of Chinese character in URL is %-encoding:
>
>     %##%##%##
>
> 9 bytes for a single Chinese character! where ideally 2 bytes should suffice.
>
> However, this is a URI issue, not an HTTP issue. Is HTTP going to
> unilaterally "upgrade" URI format? That is possible, but it seems a
> big step, and it'll only decease interop for some coming years.

... and I did not know about IRI...

Is HTTP2 going to adopt IRI?

>
> From my perspective, URLs are not a priority to optimize; they are
> usually not that big; servers can unilaterally use a more efficient
> encoding method for special chars. Maybe we should restraint from
> trying to change URI syntax.
>
> Zhong Yu
>
>>
>>> For the languages/scripts of India, South
>>> East Asia, and a few more places, it can be 200%. (For texts purely in
>>> an alphabet in the Supplemental planes such as Old Italic, Shavian,
>>> Osmanya,..., it can be 300%, but I guess we can ignore these.) But these
>>> numbers only apply to cases that don't contain any ASCII at all.
>>
>> I don't see how this is possible since you have 6 bits of data per byte
>> plus a few bits on the first byte, and you need 3 bytes to transport 16
>> bits, which is 50% for me :-)
>>
>>> >and may make compression less efficient.
>>>
>>> That depends very much on the method of compression that's used.
>>
>> I agree, but adding unused bits or entropy in general will make compression
>> algorithms less efficient.
>>
>>> >I'm not saying I'm totally against UTF-8 in HTTP/2 (eventhough I hate using
>>> >it), I'm saying that it's not *THE* solution to every problem. It's just
>>> >*A*
>>> >solution to *A* problem : "how to extend character sets in existing
>>> >documents
>>> >without having to re-encode them all". I don't think this specific problem
>>> >is
>>> >related to the scope of the HTTP/2 work, so at first glance, I'd say that
>>> >UTF-8 doesn't seem to solve a known problem here.
>>>
>>> The fact that I mentioned Websockets may have lead to a
>>> misunderstanding. I'm not proposing to use UTF-8 only in bodies, just in
>>> headers (I wouldn't object, though). My understanding was that James was
>>> talking about headers, and I was doing so, too.
>>
>> I was talking about header values too. As a developer of intermediaries,
>> I'm not interested in the body at all. I'm seeing people do ugly things
>> all the time, like regex-matching hosts with ".*\.example\.com" without
>> being aware how slow it is to do that on each and every Host header field.
>> Typically doing that with an UTF-8 aware library is even slower.
>>
>> That's why I'm having some concerns.
>>
>> Ideally, everything we transport should be in its original form. If hosts
>> come from DNS, they should appear encoded as they were returned by the DNS
>> server (even with the ugly IDN format). If paths are supposed to be UTF-8,
>> let them be sent in their raw original UTF-8 form without changing the
>> format. But then we don't want to mix Host and path, and we want to put as
>> a first rule that only the shortest forms are allowed. If most header fields
>> are pure ASCII (eg: encodings), declare them as such. If some header fields
>> are enums, use enums and not text. Etc...
>>
>> Regards,
>> Willy
>>
>>

Received on Sunday, 10 February 2013 23:26:04 UTC