Re: UTF-8 fields (was Re: More on allowed field characters)

On Mon, 30 Aug 2021 at 04:55, Willy Tarreau <w@1wt.eu> wrote:
>
> On Mon, Aug 30, 2021 at 01:23:15PM +1000, Martin Thomson wrote:
> > On Fri, Aug 27, 2021, at 16:49, Poul-Henning Kamp wrote:
> > > In UTF-8 it becomes (0xe2, 0x98, 0xb9) which HPACK expands to 65 bits.
> > >
> > > In comparison "\u2639" only takes 48 bits.
> >
> > Huffman coding is optional, so it can stay at 48.
> >
> > The good news here is that there might be a point in our future where
> > interpreting fields as UTF-8 is interoperable.  The charset debate has ended
> > for sure, we just have to wait for the remnants of the other charsets to
> > clear themselves out.  Maybe there will be enough progress by 2028 that we'll
> > be able to do something else.
>
> I personally hope this will never happen for field names. UNICODE was
> made for humans and we're discussing protocols to let computers interact.
> Placing emojis there is useless. However we know that there is a very high
> risk of aliasing between different values, that *will* cause a lot of
> security trouble and interoperability issues.

This is a strong point worth emphasising. Many language-specific
frameworks will decode HTTP headers into the language "string" type in
order to facilitate ease of use. Some languages will potentially
normalise that input, which opens a new exciting confused deputy
vector. While I agree that most modern implementations will happily
_tolerate_ a UTF-8 field name, we should steer well clear of ever
defining field names that use non-ACSII characters for exactly this
reason.

Received on Monday, 6 September 2021 08:31:15 UTC