Re: Delta Compression and UTF-8 Header Values from Martin J. Dürst on 2013-02-09 (ietf-http-wg@w3.org from January to March 2013)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Sat, 09 Feb 2013 21:36:57 +0900
To: Mark Nottingham <mnot@mnot.net>
CC: James M Snell <jasnell@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Message-ID: <511642E9.9010607@it.aoyama.ac.jp>
On 2013/02/09 8:53, Mark Nottingham wrote:
> My .02 -
>
> RFC2616 implies that the range of characters available in headers is ISO-8859-1

That's a leftover from the *very* early 1990s, when ISO-8859-1 was 
actually a step forward from the various 'national' ISO-646 7-bit 
encodings. It was not a bad idea at that time by TimBL to make the Web 
work throughout Western Europe. UTF-8 wasn't even invented then.
(see http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt)

The IETF understood the advantages of UTF-8 in the late 1990s, see 
http://tools.ietf.org/html/rfc2277#section-3.1

These days, UTF-8 isn't a step forward, it's just plain obvious. The 
browser folks at WHATWG would prefer not to have any Web pages in 
anything else than UTF-8 anymore. That will take quite some time yet, 
but the trend is very clear. See e.g. 
http://googleblog.blogspot.jp/2010/01/unicode-nearing-50-of-web.html and
http://w3techs.com/technologies/details/en-utf8/all/all. Websockets was 
designed with UTF-8 and binary built in from the start. For all kinds of 
other protocols, UTF-8 is a non-brainer, too.

It would be a good idea to try hard to make the new protocol forward 
looking (or actually just acknowledge the present, rather than stay 
frozen in the past) for the next 20 years or so in terms of character 
encoding, too, and not only in terms of CPU/network performance.

And James is right, it would allow to throw out all kinds of encoding 
cruft. That doesn't affect performance through the pipes, but it clearly 
makes things a lot easier at the ends.

Regards,   Martin.

> (while tilting the table heavily towards ASCII), and we've clarified that in bis to recommend ASCII, while telling implementations to handle anything else as opaque bytes.

> However, on the wire in HTTP/1, some bits are sent as UTF-8 (in particular, the request-URI, from one or two browsers).
>
> I think our choices are roughly:
>
> 1) everything is opaque bytes
> 2) default to ASCII, flag headers using non-ASCII bytes to preserve them
> 3) everything is ASCII, require implementations that receive non-ASCII HTTP/1.1 to translate to ASCII (e.g., convert IRIs to URIs)
>
> #1 is safest, but you don't get the benefit of re-encoding. The plan the the first implementation draft is to not try to take advantage of encoding, so it's the way we're likely to go -- for now.
>
> #2 starts to walk down the encoding path. There are many variants; we could default to blobs, default to UTF-8, etc. We could just flag "ASCII or blob" or we could define many, many possible encodings, as discussed.
>
> #3 seems risky to me.
>
> Cheers,
>
>
> On 09/02/2013, at 6:28 AM, James M Snell<jasnell@gmail.com>  wrote:
>
>> Just going through more implementation details of the proposed delta
>> encoding... one of the items that had come up previously in early
>> http/2 discussions was the possibility of allowing for UTF-8 header
>> values. Doing so would allow us to move away from things like
>> punycode, pct-encoding, Q and B-Codecs, RFC 5987 mechanisms, etc it
>> would bring along a range of other issues we would need to deal with.
>>
>> One key challenge with allowing UTF-8 values, however, is that it
>> conflicts with the use of the static huffman encoding in the proposed
>> Delta Encoding for header compression. If we allow for non-ascii
>> characters, the static huffman coding simply becomes too inefficient
>> and unmanageable to be useful. There are a few ways around it but none
>> of the strategies are all that attractive.
>>
>> So the question is: do we want to allow UTF-8 header values? Is it
>> worth the trade-off in less-efficient header compression? Or put
>> another way, is increased compression efficiency worth ruling out
>> UTF-8 header values?
>>
>> (Obviously there are other issues with UTF-8 values we'd need to
>> consider, such as http/1 interop)
>>
>> - James
>>
>
> --
> Mark Nottingham   http://www.mnot.net/
>
>
>
>
>
Received on Saturday, 9 February 2013 12:37:35 UTC