Re: Consensus call to include Display Strings in draft-ietf-httpbis-sfbis from Willy Tarreau on 2023-05-28 (ietf-http-wg@w3.org from April to June 2023)

From: Willy Tarreau <w@1wt.eu>
Date: Sun, 28 May 2023 07:05:44 +0200
To: Julian Reschke <julian.reschke@gmx.de>
Cc: ietf-http-wg@w3.org
Message-ID: <ZHLhKN6nck3aRHsy@1wt.eu>

On Sun, May 28, 2023 at 05:51:49AM +0200, Julian Reschke wrote:
> On 27.05.2023 22:40, Willy Tarreau wrote:
> > Hi Julian,
> > 
> > On Sat, May 27, 2023 at 11:55:59AM +0200, Julian Reschke wrote:
> > > On 27.05.2023 10:37, Willy Tarreau wrote:
> > > > ...
> > > 
> > > Without having read all details:
> > > 
> > > +1 to consider (!) just using raw octets
> > > 
> > > +1 not to use sf-binary
> > > 
> > > +1 to exclude ASCII controls (but not entirely sure about CR LF HTAB)
> > > 
> > > but
> > > 
> > > -1 to use anything but UTF-8 (I fail to see any reason for that) - and
> > > no, use of UTF-8 does require revising things when Unicode code points
> > > are added
> > 
> > Unless I'm totally mistaken, the maximum sequence length has increased
> > over time to support new code points. I remember having myself implemented
> > decoding functions a long time ago in a security component where we were
> > required to fail past 4 or maybe 5 bytes, and that I later learned that
> > they had to extend it by one or two bytes to support new code points. I
> > don't remember the exact details but my point is that we must not impose
> > this absurdly insecure decoding to infrastructure components, or they
> > will regularly be accusated of blocking valid contents :-/  As long as
> > they can pass it as-is and it's the recipient's goal to figure if they
> > successfully decode or not, that's fine by me.
> 
> AFAIU, the UTF-8 encoding/decoding function (sequence of code points to
> octets and vice versa) never has changed (see
> https://datatracker.ietf.org/doc/html/rfc3629#section-3). Am I missing
> something here?

No you're indeed right. But I have clear memories of this "common"
approach of iterating over a string as long as (c & 0xc0) == 0x80
(which was the main concern) as well as the possibility of larger
code sequences they didn't want to support (that was in early
2000/2001). I'm still seeing traces of this in the FSS-UTF proposal:

  https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

     Bits  Hex Min  Hex Max  Byte Sequence in Binary
  1    7  00000000 0000007f 0vvvvvvv
  2   11  00000080 000007FF 110vvvvv 10vvvvvv
  3   16  00000800 0000FFFF 1110vvvv 10vvvvvv 10vvvvvv
  4   21  00010000 001FFFFF 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv
  5   26  00200000 03FFFFFF 111110vv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv
  6   31  04000000 7FFFFFFF 1111110v 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv

So maybe back then I only had to implement the 16-bit one and they
later wanted to support the 21-bit one as well, I don't remember the
exact details. But there's less risk if the standardized codes have
a fixed maximum length, I agree. I just don't want to have to validate
them when forwarding header fields ;-)

Regards,
willy

Received on Sunday, 28 May 2023 05:05:51 UTC