Re: Unicode escape sequence | Re: draft-ietf-httpbis-header-structure-00, unicode range from Mark Nottingham on 2016-12-23 (ietf-http-wg@w3.org from October to December 2016)

From: Mark Nottingham <mnot@mnot.net>
Date: Fri, 23 Dec 2016 09:37:26 -0500
To: Poul-Henning Kamp <phk@phk.freebsd.dk>
Cc: Martin Thomson <martin.thomson@gmail.com>, "Julian F. Reschke" <julian.reschke@gmx.de>, Alexey Melnikov <alexey.melnikov@isode.com>, Matthew Kerwin <matthew@kerwin.net.au>, Kari Hurtta <hurtta-ietf@elmme-mailer.org>, Ilari Liusvaara <ilariliusvaara@welho.com>, HTTP working group mailing list <ietf-http-wg@w3.org>, Poul-Henning Kamp <phk@varnish-cache.org>
Message-Id: <C8F21FA8-8B03-4E9C-B0E8-CD3C9CF028CE@mnot.net>

> On 14 Dec. 2016, at 7:54 am, Poul-Henning Kamp <phk@phk.freebsd.dk> wrote:
> 
> --------
> In message <CABkgnnWzOhkznH2HzweNegYo4dDHE+DT0PM=eCSvVr+-Wkup1A@mail.gmail.com>
> , Martin Thomson writes:
> 
>> I can't remember, is there actually a good reason why we can't just
>> start shoving UTF-8 in header fields?  I mean, h2 is probably OK with
>> this.
> 
> You mean "h2 end to end" ?  Yes, probably.
> 
> But what about H2->H1 and H1->H2 proxies/load-balancers/etc ?

Furthermore (and it bears repeating), the ends are never just "HTTP." 

On the server side, they're a mash of CGI, FastCGI and various other interfaces to languages like Perl, Python, PHP and Ruby, each with their own galaxy of library modules, frameworks and such.

On the client side, things get easier because of the relative alignment between browsers*, but you still have to consider non-browser clients, including spiders, robots, scrapers -- and the various libraries and infrastructure they use.

Both sides are implemented by intermediaries, whether that be "forward" proxies, "reverse" ones, CDNs, load balancers, firewalls, or on-machine virus scanners (ew). If you're really lucky, they might pass through an ICAP hop or two.

Off-path, you need to consider logging and monitoring software, as well as configuration interfaces that allow headers to be manipulated (e.g. through Web form -- that should be fun).

Potentially, all of these interfaces and pieces of software touch HTTP headers, and might assume that they are ASCII, 8859-1, UTF-8, or a bytestring. 

That's not to say that we can't use more than the least common denominator (ASCII), but we don't know how much trouble doing so will cause. And, as discussed previously, there aren't a lot of use cases for non-ASCII header values in standards (because few have a payload that's exposed to end users), so the reward for taking that risk is questionable.

Cheers,

* If you believe in Fetch.

--
Mark Nottingham   https://www.mnot.net/

Received on Friday, 23 December 2016 14:37:58 UTC