Re: New Version Notification for draft-nottingham-structured-headers-00.txt from Kazuho Oku on 2017-11-02 (ietf-http-wg@w3.org from October to December 2017)

From: Kazuho Oku <kazuhooku@gmail.com>
Date: Thu, 2 Nov 2017 10:05:31 +0900
To: Willy Tarreau <w@1wt.eu>
Cc: Poul-Henning Kamp <phk@phk.freebsd.dk>, Mark Nottingham <mnot@mnot.net>, HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <CANatvzxZbF-5Oe6SdF4XTgjyeGutnO=PYMLebRBc8yPNbn6XAA@mail.gmail.com>
Hi Willy,

2017-11-01 15:41 GMT+09:00 Willy Tarreau <w@1wt.eu>:
> Hi Kazuho,
>
> On Wed, Nov 01, 2017 at 10:52:53AM +0900, Kazuho Oku wrote:
>> How long is the expected lifetime of Structured Headers? Assuming that
>> it would be used for 20 years (HTTP has been used for 20+ years, TCP
>> is used for 40+ years), there is fair chance that the 49ž bits limit
>> is too small. Note that even if we switch to transferring headers in
>> binary-encoded forms, we might continue using Structured Headers for
>> textual representation.
>>
>> Do we want to risk making _all_ our future implementations complex in
>> exchange of being friendly to _some_ programming languages without
>> 64-bit integers?
>
> That's an interesting question that cannot be solved just by a Yes or a
> No. Making a language totally unable to implement a protocol (especially
> HTTP) is a no-go, and may even ignite the proposal of alternatives for
> some parts. So we must at least ensure that it is reasonably possible
> to implement the protocol even if that requires a little bit of efforts
> and if performance sucks, because people choosing such languages despite
> such limitatons will do it only for convenience and the languages will
> evolve to make their lifes easier in the future. What must really be
> avoided is everything requiring a full-range 64-bit internal
> representation all the time. But if 64-bit are needed only for large
> files, most developers will consider that their implementation is
> sufficient for *their* use cases (even if only 31 or 32 bits).

I agree.

What I am arguing is that we should allow applications to send all
integers (especially all of those that fit into 64-bit) using series
of digits, rather than requiring use of strings, labels, or base64 for
storing them.

In my view, an application would not be enforced to handle every
number represented by a 64-bit number even if Structured Headers
defines handling of 64-bit numbers as a minimal requirement. For
example, you cannot download a file of 1EiB size, unless your
filesystem supports storing such large files.

> This is what the text-based integer representation has brought us over
> the last two decades : good interoperability between implementations with
> very different limits. The ESP8266 in my alarm clock with 50kB of RAM
> might very well be using 16-bit integers for content-length and despite
> this it's compatible with the rest of the world. Similarly haproxy's
> chunk size parser used to be limited to 28 bits for a while and was
> only recently raised to 32 after hitting this limit once.
>
>> The other thing I would like to point out is that mandating support
>> for 64-bit integer fields does not necessary mean that you cannot
>> easily represent such kind of fields when using the programming
>> languages without 64-bit integers.
>
> It only depends if all bits of the fields are always needed or not in
> general. If it's just a size, anyone can decide that limiting their
> implementation to 32-bit can be OK for their purpose.
>
>> This is because there is no need to store an integer field using
>> integers. Decoders of Structured Headers can retain the representation
>> as a string (i.e. series of digits), and applications can convert them
>> to numbers when they want to use the value for calculation.
>
> It can indeed be an option as well. A punishment I would say.

Actually I see it as an optimization.

Assuming that we would not be actually using all the numbers sent
using Structured Headers, it makes sense to delay converting them to
internal numeric representation (i.e. `int64_t`) until it becomes
necessary. In fact, many of us already have such kind of optimization.
For example, many of the HTTP clients keep the Last-Modified header in
string received as-is, since it is seldom required to make
calculations using the value. It is wise to keep them as strings (and
send them as part of the If-Modified-Sence header). What I am
suggesting is that the fields of Structured Headers can be handled the
same way.

Note that such optimization might not make sense for hyperscalar CPUs,
since they could validate the numeric representation (i.e. check if
the characters are digits) at the same time convert them to an
integral type. But still, it could be a good optimization for embedded
devices with less-complicated CPUs that we are trying to take care of
in this thread.

>> Since the size of the files transmitted today do not exceed 1PB, such
>> approach will not have any issues today. As they start handling files
>> larger than 1PB, they will figure out how to support 64-bit integers
>> anyways. Otherwise they cannot access the file! Considering that, I
>> would argue that we are unlikely to see issues in the future as well,
>> with programming languages that do not support 64-bit integers _now_.
>
> I totally agree with this. I like to optimize for valid use cases, and
> in general use cases vary with implementations. Similarly I want my code
> to be fast on fast machines (because people by fast machines for
> performance) and small on resource constrained machines. People adapt
> their hardware and software to their needs, the design must scale, not
> necessarily the implementations.
>
>> To summarize, 49ž bits limit is scary considering the expected
>> lifetime of a standard, and we can expect programming languages that
>> do not support 64-bit integers to start supporting them as we start
>> using files of petabyte size.
>
> I think we can solve such issues by specifying some protocol limits
> depending on implementations. Not doing this is what has caused some
> issues in the past. Content-lengths larger than 2^32 causing some
> implementations to wrap for example have been used to cause request
> smuggling attacks. But by insisting on boundary checking for the
> critical parts of the protocol depending on the storage type (for
> well known types), we can at least help implementers remain safe.
>
>> > If your 64 bit number is an identifier, the only valid operation
>> > on it is "check for identity", and taking the detour over a decimal
>> > representation is not only uncalled for, but also very inefficient
>> > in terms of CPU cycles.
>> >
>> > The natural and most efficient format for such an identifier would
>> > be base64 binary, but if for some reason it has to be decimal, say
>> > convenience for human debuggers, one could prefix it with a "i" and
>> > send it as a label.
>>
>> Requiring the use of base64 goes against the merit of using a textual
>> representation. The reason we use textual representation is because it
>> is easy for us to read and use. On most systems, 64-bit IDs are
>> represented as numbers. So people would want to transmit them in the
>> same representation over HTTP as well. So to me it seems that it is
>> whether we want 64-bit integers to be sent as numbers of strings (or
>> labels). That is the reason why I only compared the two options in my
>> previous mail.
>
> There is also the option of considering such identifiers as arrays of
> 32-bit for implementers, since they don't need to perform operations
> on them. This is something we can explain in a spec (ie: how to parse
> identifiers in general).
>
>> In this respect, another issue we should consider is that we can more
>> effectively compress the data if we know that it is a number
>> (comparing to compressing it as a text or a label).
>
> Yep, I like the principle of variable length integers. It's not very
> efficient in CPU cycles but is still much more than any text-based
> representation when all code points are valid as it doesn't require
> syntax validation. But the benefits are huge for all small elements.
> We did this in haproxy's peers protocol (used to synchronize internal
> tables betweeen multiple nodes) because the most common types exchanged
> were server identifiers (typically values lower than 10 for 95% of
> deployments), and incremental counters (up to 64-bit byte counts) and
> we didn't want to use multiple types. By proceeding like this we can
> ensure that implementations are not very difficult and can accept
> limitations depending on their targetted use cases. And the bandwidth
> remains as small as possible.
>
> By the way it is important to keep in mind that a data type is not
> necessarily related to the programming language's internal representation.
> IP addresses are not numbers, identifiers are not numbers, eventhough
> they can often be represented as such for convenience. Numbers have a
> fairly different distribution with far more small values than large ones.
> Identifiers (and addresses) on the opposite are more or less uniformly
> distributed and do not benefit from variable length compression.
>
> Cheers,
> Willy



-- 
Kazuho Oku
Received on Thursday, 2 November 2017 01:05:55 UTC