Re: Delta Compression and UTF-8 Header Values from James M Snell on 2013-02-09 (ietf-http-wg@w3.org from January to March 2013)

From: James M Snell <jasnell@gmail.com>
Date: Fri, 8 Feb 2013 17:10:02 -0800
To: Mark Nottingham <mnot@mnot.net>
Cc: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Message-ID: <CABP7RbcRrjV7EhwoGbkWbYJEXeWOwH4gQuaCG7N0siQqeMtcag@mail.gmail.com>

On Fri, Feb 8, 2013 at 3:53 PM, Mark Nottingham <mnot@mnot.net> wrote:
> My .02 -
>
> RFC2616 implies that the range of characters available in headers is ISO-8859-1 (while tilting the table heavily towards ASCII), and we've clarified that in bis to recommend ASCII, while telling implementations to handle anything else as opaque bytes.
>
> However, on the wire in HTTP/1, some bits are sent as UTF-8 (in particular, the request-URI, from one or two browsers).
>
> I think our choices are roughly:
>
> 1) everything is opaque bytes
> 2) default to ASCII, flag headers using non-ASCII bytes to preserve them
> 3) everything is ASCII, require implementations that receive non-ASCII HTTP/1.1 to translate to ASCII (e.g., convert IRIs to URIs)
>
> #1 is safest, but you don't get the benefit of re-encoding. The plan the the first implementation draft is to not try to take advantage of encoding, so it's the way we're likely to go -- for now.
>
> #2 starts to walk down the encoding path. There are many variants; we could default to blobs, default to UTF-8, etc. We could just flag "ASCII or blob" or we could define many, many possible encodings, as discussed.
>
> #3 seems risky to me.
>

I have the distinct feeling we're going to end up somewhere between #1
and #2.. which means bad things for the static huffman-coding. If we
end up with #2, we'll be able to huffman code anything that is flagged
as ASCII, and won't be able to touch the rest.

- James

> Cheers,
>
>
> On 09/02/2013, at 6:28 AM, James M Snell <jasnell@gmail.com> wrote:
>
>> Just going through more implementation details of the proposed delta
>> encoding... one of the items that had come up previously in early
>> http/2 discussions was the possibility of allowing for UTF-8 header
>> values. Doing so would allow us to move away from things like
>> punycode, pct-encoding, Q and B-Codecs, RFC 5987 mechanisms, etc it
>> would bring along a range of other issues we would need to deal with.
>>
>> One key challenge with allowing UTF-8 values, however, is that it
>> conflicts with the use of the static huffman encoding in the proposed
>> Delta Encoding for header compression. If we allow for non-ascii
>> characters, the static huffman coding simply becomes too inefficient
>> and unmanageable to be useful. There are a few ways around it but none
>> of the strategies are all that attractive.
>>
>> So the question is: do we want to allow UTF-8 header values? Is it
>> worth the trade-off in less-efficient header compression? Or put
>> another way, is increased compression efficiency worth ruling out
>> UTF-8 header values?
>>
>> (Obviously there are other issues with UTF-8 values we'd need to
>> consider, such as http/1 interop)
>>
>> - James
>>
>
> --
> Mark Nottingham   http://www.mnot.net/
>
>
>

Received on Saturday, 9 February 2013 01:10:56 UTC