RE: character encoding in header fields, was: SPDY Header Frames from Robert Brewer on 2012-07-17 (ietf-http-wg@w3.org from July to September 2012)

From: Robert Brewer <fumanchu@aminus.org>
Date: Tue, 17 Jul 2012 08:57:55 -0700
To: "Julian Reschke" <julian.reschke@gmx.de>, "James M Snell" <jasnell@gmail.com>
Cc: "Amos Jeffries" <squid3@treenet.co.nz>, <ietf-http-wg@w3.org>
Message-ID: <F1962646D3B64642B7C9A06068EE1E6415F17159@ex10.hostedexchange.local>

Julian Reschke wrote:
> On 2012-07-17 16:48, James M Snell wrote:
> > Tunneling 1.1 traffic via 2.0 would likely be the easy part; it's the
> 
> Not even that. Given an HTTP/1.1 message containing non-ASCII octets in
> header field value, you simply don't know what unicode characters to
> map
> them to.
> 
> This is not theoretical; some UAs process UTF-8 in Content-Disposition,
> some use the installation's locale character set.
> 
> Yes, this is a mess, but it's not clear to me how to break out of it
> without breaking *some* setups that currently "work".
> 
> > ...
> > The one thing we need to determine is: how critical is the ability to
> > support seamless down-level conversion from 2.0 to 1.1 within a
> request?
> > Is it acceptable for us to say that while 2.0 can be used to
> transport
> > 1.1 messages, the reverse is not possible.
> > ...
> 
> So how do you transport a 1.1 message inside 2.0 if it contains
> non-ASCII? Treat the header field value as binary?

Just to share a field note: The Python web community dealt with this exact problem recently with the advent of Python 3, which elevated Unicode quite a bit and exposed this problem more clearly to many. The chosen solution was to take the bytes-of-unknown-encoding and decode them as ISO-8859-1 (which at least won't error on any byte sequence), and leave that mess for a higher layer (which presumably would have more context) to re-encode/decode if they liked. Not a perfect solution but better than nothing.

Robert Brewer
fumanchu@aminus.org

Received on Tuesday, 17 July 2012 15:58:41 UTC