Re: New Version Notification for draft-nottingham-structured-headers-00.txt from Andy Green on 2017-11-04 (ietf-http-wg@w3.org from October to December 2017)

From: Andy Green <andy@warmcat.com>
Date: Sat, 4 Nov 2017 18:00:10 +0800
To: Willy Tarreau <w@1wt.eu>
Cc: Matthew Kerwin <matthew@kerwin.net.au>, Mark Nottingham <mnot@mnot.net>, Kazuho Oku <kazuhooku@gmail.com>, Poul-Henning Kamp <phk@phk.freebsd.dk>, HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <94f3ff92-5b92-7559-7c17-5e70cffa8879@warmcat.com>
On 11/04/2017 05:29 PM, Willy Tarreau wrote:
> On Sat, Nov 04, 2017 at 03:59:00PM +0800, Andy Green wrote:
>>> The ESP in my alarm clock disagrees with you here :-)
>>
>> Well, the chip doesn't care :-) it's Willy that disagrees with me.
> 
> Sure :-)
> 
>> I don't
>> claim to know the 64-bitness of BASIC or whatever you are testing with and
>> you don't describe it.
> 
> Sorry, it's nodemcu, the Lua interpreter.
> 
>> So I think it is not relevant there is some code
>> that runs on ESP platforms that just has 32-bit types.  That can be true on
>> any platform.
> 
> Except that it's very popular, and just one example among others. I'm not

Mmm Lua is not that popular, and clearly it's the wrong tool to write a 
http server with if it dies at 32-bit.  Because for a long time people 
are serving files that are larger than 2GB routinely.

Bash is much more popular example that only has to support 32-bit, 
although modern implementations do support 64-bit.  Again, 32-bit bash 
isn't the right tool for serving files with then... no news there.  I 
don't think the larger web standards should be designed around these 
transitory and specific limitations.

> trying to find statistics on how many are OK and how many are broken, I'm
> just saying that there are limitations at many places and that we have to
> deal with these. I mentionned in another e-mail (I'm just realizing it was
> off-list now) that a cleanly and carefully written HTTP server used as a
> reference for nodemcu on this platform uses "tonumber()" to parse the
> content-length, an unfortunately as you can see below, tonumber() has its
> limits as well :
> 
>    > print(tonumber("3333333333"))
>    2147483647
>    > print(tonumber("0x123"))
>    291
>    > print(tonumber("000123"))
>    123

Okay... so that specific implementation is broken and no good for 
dealing with the >2GB reality we already live in for many years... there 
are many things that are no good for that task... these things can be 
forced into view by interoperability test suites...

> Just an indication that developers are not always aware of the limits
> they have to deal with nor the validity domain of the functions they're
> using. The purpose of Structured Headers is to have safer and more
> portable parsers, so we have to consider this above. Ideally I'd like
> to see a set of HTTP number parsers safe for use progressively appear as
> a replacement for tonumber(), atoi() and consorts athat web applications
> should use over the long term.
> 
>> It remains a fact that both ESP8266 and ESP32 recommended stock gcc
>> toolchain from Espressif, that you would write http stuff with, supports
>> proper 64-bit long long, making deploying it and all the operators using it
>> trivial.
> 
> Probably but not everyone uses C on such a platform when you have easy-to-use
> alternatives like Lua, micropython and probably others. That's part of todays
> web landscape unfortunately.

Micropython... seems written in C

https://github.com/micropython/micropython

Lua... written in C

https://github.com/lua/lua

These broken 32-bit limited implementations are *already inadequate* for 
the >2GB world we have been living in for a long time now.  Their 
problem is just their specific implementation decisions to stick at 
32-bit internally... that's just their problem, not everybody's problem. 
  I don't think trying to restrict standards to meet their arbitrary and 
unnecessary (since the compiler underneath supports long long / int64_t 
just fine) limitations is a reasonable way forward.

>>>     > print(32768*32768)
>>>     1073741824
>>>     > print(32768*65536)
>>>     -2147483648
>>>     > print(65536*65536)
>>>     0
>>>     > print(131072*65536)
>>>     0
>>>     > print(2147483648)
>>>     2147483647
>>>     > print(3333333333)
>>>     2147483647
>>
>> Here are the same trials done in C using gcc on ESP32 (because that is what
>> I have to hand; but it's basically the same gcc toolchain + 32-bit Tensilica
>> core as ESP32).
>>
>>          lwsl_notice("32768 * 32768 = %lld\n", (long long)32768 * (long
>> long)32768);
>>          lwsl_notice("32768 * 65536 = %lld\n", (long long)32768 * (long
>> long)65536);
>>          lwsl_notice("65536 * 65536 = %lld\n", (long long)65536 * (long
>> long)65536);
>>          lwsl_notice("131072 * 65536 = %lld\n", (long long)32768 * (long
>> long)32768);
>>          lwsl_notice("2147483648 = %lld\n", (long long)2147483648);
>>          lwsl_notice("3333333333 = %lld\n", (long long)3333333333);
>>
>> 4: 32768 * 32768 = 1073741824
>> 4: 32768 * 65536 = 2147483648
>> 4: 65536 * 65536 = 4294967296
>> 4: 131072 * 65536 = 1073741824
>> 4: 2147483648 = 2147483648
>> 4: 3333333333 = 3333333333
> 
> That's perfect and I'm not surprized.

Okie, so there is no problem with ESP devices using 64-bit types we agree.

>>> But in C, code using atoi() to parse integers is very common, and when the
>>> developers are told that atoi() is too short and unreliable and that they
>>> must use strtol(), they end up using it naively causing such hard to detect
>>> problems when they're not aware of the impacts :
>>>
>>>     #include <stdio.h>
>>>     #include <stdlib.h>
>>>
>>>     int main(int argc, char **argv)
>>>     {
>>>           printf("atoi:   0x%16lx (%ld)\n", (long)atoi(argv[1]), (long)atoi(argv[1]));
>>>           printf("strtol: 0x%16lx (%ld)\n", (long)strtol(argv[1], NULL, 0), (long)strtol(argv[1], NULL, 0));
>>>           return 0;
>>>     }
>>>
>>>     $ ./a.out 2147483647
>>>     atoi:   0x        7fffffff (2147483647)
>>>     strtol: 0x        7fffffff (2147483647)
>>>
>>>     $  ./a.out 2147483648
>>>     atoi:   0xffffffff80000000 (-2147483648)
>>>     strtol: 0x        80000000 (2147483648)
>>>
>>>     $ ./a.out 4294967296
>>>     atoi:   0x               0 (0)
>>>     strtol: 0x       100000000 (4294967296)
>>>     $ ./a.out 00003333
>>>     atoi:   0x             d05 (3333)
>>>     strtol: 0x             6db (1755)
>>
>> Ehhhhhhhh that's **long** you are using there.  I am talking about long
>> long.  You underestimate C programmers if you think they don't know the
>> difference.
> 
> Sorry, I forgot to mention that this was done on my 64-bit PC where long
> and long-long are 64-bit. Look carefully you'll see that strtol() is safe
> against overflow, but that with base set to zero as commonly found, it
> parses hex as explained in the manual. I'm not making up this example,
> I've seen such things used a lot of times. In fact people start with
> atoi() until they're hit by a parsing issue, then switch to strtol() and
> don't care about specifying the or even worse, purposely support this
> because the same parser is used for content-length and for configuration.

Well, that implementation is broken.  There are other broken C lib apis 
like gets(), but people still regard string processing with C as possible.

Lately there is a trend for tools like h2spec or Autobahn to confirm 
interoperation that will discover and flag this kind of issue just fine.

>>> That's why I think that we must really take care of 32-bit. Signed 32-bits
>>> (hence unsigned 31 bits) are the only really portable integers. Most code
>>> works pretty well above this, but naive implementations easily get caught
>>> like this.
>>
>> This is why the C guys invented int64_t and friends (which are just typedefs
>> into long long or whatever).  That is **thoroughly** portable, not just
>> ESP8266 + ESP32 gcc but even windows has <stdint.h> with them.
> 
> I agree. The web is just not only C (unfortunately as that's by far my
> preferred language). It even used to be shell scripts for CGI at an era
> where most shells were limited to 32-bit evaluation.

As we see though, underneath the lua and micropython examples, it is C. 
Yes those restricted implementations exist, but it's equally a clear 
reality the web does not stop at 2GB or 4GB for many years now, and 
neither to the C compilers.  So the problem is only these specific 
language implementations must either grow ways to represent larger ints 
or become irrelevant.

>> I don't have an opinion on whether the thing being discussed should deal
>> with BIGINT, it just seemed to be missing from the discussion.  If it did,
>> it would cover every other smaller limit case, but it would force people to
>> deal with their length + MSB data format.  For anything other than MPINT /
>> BIGINT, a 64-bit limit will cover anything related to dataset size for the
>> foreseeable future.
> 
> I really think we should *encourage* 64-bit processing, *suggest* that
> anything larger is possible provided it is handled with extreme care,
> *remind* that maximum interoperability is achieved below 2^31, and
> *enforce* strict parsing and detection of overflows in any case. If we
> design with this, we should be able to make the best design choices for
> most use cases and ensure that incompatibilities are safely covered.

I am afraid 32-bit only limit is basically useless for general web use 
for many years already.  Languages that don't even have a way to deal 
with >32-bit quantities are fundamentally broken and useless for a web 
with >2GB and >4GB objects.  That can't be the guide for standards, 
otherwise we would have inherited a web that coddles WIN16 / 8088 
limitations according to this logic.

-Andy

> Willy
>
Received on Saturday, 4 November 2017 10:01:05 UTC