Re: New Version Notification for draft-nottingham-structured-headers-00.txt from Willy Tarreau on 2017-11-04 (ietf-http-wg@w3.org from October to December 2017)

From: Willy Tarreau <w@1wt.eu>
Date: Sat, 4 Nov 2017 10:29:06 +0100
To: Andy Green <andy@warmcat.com>
Cc: Matthew Kerwin <matthew@kerwin.net.au>, Mark Nottingham <mnot@mnot.net>, Kazuho Oku <kazuhooku@gmail.com>, Poul-Henning Kamp <phk@phk.freebsd.dk>, HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <20171104092906.GB19269@1wt.eu>
On Sat, Nov 04, 2017 at 03:59:00PM +0800, Andy Green wrote:
> > The ESP in my alarm clock disagrees with you here :-)
> 
> Well, the chip doesn't care :-) it's Willy that disagrees with me.

Sure :-)

> I don't
> claim to know the 64-bitness of BASIC or whatever you are testing with and
> you don't describe it.

Sorry, it's nodemcu, the Lua interpreter.

> So I think it is not relevant there is some code
> that runs on ESP platforms that just has 32-bit types.  That can be true on
> any platform.

Except that it's very popular, and just one example among others. I'm not
trying to find statistics on how many are OK and how many are broken, I'm
just saying that there are limitations at many places and that we have to
deal with these. I mentionned in another e-mail (I'm just realizing it was
off-list now) that a cleanly and carefully written HTTP server used as a
reference for nodemcu on this platform uses "tonumber()" to parse the
content-length, an unfortunately as you can see below, tonumber() has its
limits as well :

  > print(tonumber("3333333333"))
  2147483647
  > print(tonumber("0x123"))
  291
  > print(tonumber("000123"))
  123

Just an indication that developers are not always aware of the limits
they have to deal with nor the validity domain of the functions they're
using. The purpose of Structured Headers is to have safer and more
portable parsers, so we have to consider this above. Ideally I'd like
to see a set of HTTP number parsers safe for use progressively appear as
a replacement for tonumber(), atoi() and consorts athat web applications
should use over the long term.

> It remains a fact that both ESP8266 and ESP32 recommended stock gcc
> toolchain from Espressif, that you would write http stuff with, supports
> proper 64-bit long long, making deploying it and all the operators using it
> trivial.

Probably but not everyone uses C on such a platform when you have easy-to-use
alternatives like Lua, micropython and probably others. That's part of todays
web landscape unfortunately.

> >    > print(32768*32768)
> >    1073741824
> >    > print(32768*65536)
> >    -2147483648
> >    > print(65536*65536)
> >    0
> >    > print(131072*65536)
> >    0
> >    > print(2147483648)
> >    2147483647
> >    > print(3333333333)
> >    2147483647
> 
> Here are the same trials done in C using gcc on ESP32 (because that is what
> I have to hand; but it's basically the same gcc toolchain + 32-bit Tensilica
> core as ESP32).
> 
>         lwsl_notice("32768 * 32768 = %lld\n", (long long)32768 * (long
> long)32768);
>         lwsl_notice("32768 * 65536 = %lld\n", (long long)32768 * (long
> long)65536);
>         lwsl_notice("65536 * 65536 = %lld\n", (long long)65536 * (long
> long)65536);
>         lwsl_notice("131072 * 65536 = %lld\n", (long long)32768 * (long
> long)32768);
>         lwsl_notice("2147483648 = %lld\n", (long long)2147483648);
>         lwsl_notice("3333333333 = %lld\n", (long long)3333333333);
> 
> 4: 32768 * 32768 = 1073741824
> 4: 32768 * 65536 = 2147483648
> 4: 65536 * 65536 = 4294967296
> 4: 131072 * 65536 = 1073741824
> 4: 2147483648 = 2147483648
> 4: 3333333333 = 3333333333

That's perfect and I'm not surprized.

> > But in C, code using atoi() to parse integers is very common, and when the
> > developers are told that atoi() is too short and unreliable and that they
> > must use strtol(), they end up using it naively causing such hard to detect
> > problems when they're not aware of the impacts :
> > 
> >    #include <stdio.h>
> >    #include <stdlib.h>
> > 
> >    int main(int argc, char **argv)
> >    {
> >          printf("atoi:   0x%16lx (%ld)\n", (long)atoi(argv[1]), (long)atoi(argv[1]));
> >          printf("strtol: 0x%16lx (%ld)\n", (long)strtol(argv[1], NULL, 0), (long)strtol(argv[1], NULL, 0));
> >          return 0;
> >    }
> > 
> >    $ ./a.out 2147483647
> >    atoi:   0x        7fffffff (2147483647)
> >    strtol: 0x        7fffffff (2147483647)
> > 
> >    $  ./a.out 2147483648
> >    atoi:   0xffffffff80000000 (-2147483648)
> >    strtol: 0x        80000000 (2147483648)
> > 
> >    $ ./a.out 4294967296
> >    atoi:   0x               0 (0)
> >    strtol: 0x       100000000 (4294967296)
> >    $ ./a.out 00003333
> >    atoi:   0x             d05 (3333)
> >    strtol: 0x             6db (1755)
> 
> Ehhhhhhhh that's **long** you are using there.  I am talking about long
> long.  You underestimate C programmers if you think they don't know the
> difference.

Sorry, I forgot to mention that this was done on my 64-bit PC where long
and long-long are 64-bit. Look carefully you'll see that strtol() is safe
against overflow, but that with base set to zero as commonly found, it
parses hex as explained in the manual. I'm not making up this example,
I've seen such things used a lot of times. In fact people start with
atoi() until they're hit by a parsing issue, then switch to strtol() and
don't care about specifying the or even worse, purposely support this
because the same parser is used for content-length and for configuration.

> > That's why I think that we must really take care of 32-bit. Signed 32-bits
> > (hence unsigned 31 bits) are the only really portable integers. Most code
> > works pretty well above this, but naive implementations easily get caught
> > like this.
> 
> This is why the C guys invented int64_t and friends (which are just typedefs
> into long long or whatever).  That is **thoroughly** portable, not just
> ESP8266 + ESP32 gcc but even windows has <stdint.h> with them.

I agree. The web is just not only C (unfortunately as that's by far my
preferred language). It even used to be shell scripts for CGI at an era
where most shells were limited to 32-bit evaluation.

> I don't have an opinion on whether the thing being discussed should deal
> with BIGINT, it just seemed to be missing from the discussion.  If it did,
> it would cover every other smaller limit case, but it would force people to
> deal with their length + MSB data format.  For anything other than MPINT /
> BIGINT, a 64-bit limit will cover anything related to dataset size for the
> foreseeable future.

I really think we should *encourage* 64-bit processing, *suggest* that
anything larger is possible provided it is handled with extreme care,
*remind* that maximum interoperability is achieved below 2^31, and
*enforce* strict parsing and detection of overflows in any case. If we
design with this, we should be able to make the best design choices for
most use cases and ensure that incompatibilities are safely covered.

Willy
Received on Saturday, 4 November 2017 09:30:20 UTC