Re: New Version Notification for draft-nottingham-structured-headers-00.txt

Hi Kazuho,

On Wed, Nov 01, 2017 at 10:52:53AM +0900, Kazuho Oku wrote:
> How long is the expected lifetime of Structured Headers? Assuming that
> it would be used for 20 years (HTTP has been used for 20+ years, TCP
> is used for 40+ years), there is fair chance that the 49¾ bits limit
> is too small. Note that even if we switch to transferring headers in
> binary-encoded forms, we might continue using Structured Headers for
> textual representation.
> 
> Do we want to risk making _all_ our future implementations complex in
> exchange of being friendly to _some_ programming languages without
> 64-bit integers?

That's an interesting question that cannot be solved just by a Yes or a
No. Making a language totally unable to implement a protocol (especially
HTTP) is a no-go, and may even ignite the proposal of alternatives for
some parts. So we must at least ensure that it is reasonably possible
to implement the protocol even if that requires a little bit of efforts
and if performance sucks, because people choosing such languages despite
such limitatons will do it only for convenience and the languages will
evolve to make their lifes easier in the future. What must really be
avoided is everything requiring a full-range 64-bit internal
representation all the time. But if 64-bit are needed only for large
files, most developers will consider that their implementation is
sufficient for *their* use cases (even if only 31 or 32 bits).

This is what the text-based integer representation has brought us over
the last two decades : good interoperability between implementations with
very different limits. The ESP8266 in my alarm clock with 50kB of RAM
might very well be using 16-bit integers for content-length and despite
this it's compatible with the rest of the world. Similarly haproxy's
chunk size parser used to be limited to 28 bits for a while and was
only recently raised to 32 after hitting this limit once.

> The other thing I would like to point out is that mandating support
> for 64-bit integer fields does not necessary mean that you cannot
> easily represent such kind of fields when using the programming
> languages without 64-bit integers.

It only depends if all bits of the fields are always needed or not in
general. If it's just a size, anyone can decide that limiting their
implementation to 32-bit can be OK for their purpose.

> This is because there is no need to store an integer field using
> integers. Decoders of Structured Headers can retain the representation
> as a string (i.e. series of digits), and applications can convert them
> to numbers when they want to use the value for calculation.

It can indeed be an option as well. A punishment I would say.

> Since the size of the files transmitted today do not exceed 1PB, such
> approach will not have any issues today. As they start handling files
> larger than 1PB, they will figure out how to support 64-bit integers
> anyways. Otherwise they cannot access the file! Considering that, I
> would argue that we are unlikely to see issues in the future as well,
> with programming languages that do not support 64-bit integers _now_.

I totally agree with this. I like to optimize for valid use cases, and
in general use cases vary with implementations. Similarly I want my code
to be fast on fast machines (because people by fast machines for
performance) and small on resource constrained machines. People adapt
their hardware and software to their needs, the design must scale, not
necessarily the implementations.

> To summarize, 49¾ bits limit is scary considering the expected
> lifetime of a standard, and we can expect programming languages that
> do not support 64-bit integers to start supporting them as we start
> using files of petabyte size.

I think we can solve such issues by specifying some protocol limits
depending on implementations. Not doing this is what has caused some
issues in the past. Content-lengths larger than 2^32 causing some
implementations to wrap for example have been used to cause request
smuggling attacks. But by insisting on boundary checking for the
critical parts of the protocol depending on the storage type (for
well known types), we can at least help implementers remain safe.

> > If your 64 bit number is an identifier, the only valid operation
> > on it is "check for identity", and taking the detour over a decimal
> > representation is not only uncalled for, but also very inefficient
> > in terms of CPU cycles.
> >
> > The natural and most efficient format for such an identifier would
> > be base64 binary, but if for some reason it has to be decimal, say
> > convenience for human debuggers, one could prefix it with a "i" and
> > send it as a label.
> 
> Requiring the use of base64 goes against the merit of using a textual
> representation. The reason we use textual representation is because it
> is easy for us to read and use. On most systems, 64-bit IDs are
> represented as numbers. So people would want to transmit them in the
> same representation over HTTP as well. So to me it seems that it is
> whether we want 64-bit integers to be sent as numbers of strings (or
> labels). That is the reason why I only compared the two options in my
> previous mail.

There is also the option of considering such identifiers as arrays of
32-bit for implementers, since they don't need to perform operations
on them. This is something we can explain in a spec (ie: how to parse
identifiers in general).

> In this respect, another issue we should consider is that we can more
> effectively compress the data if we know that it is a number
> (comparing to compressing it as a text or a label).

Yep, I like the principle of variable length integers. It's not very
efficient in CPU cycles but is still much more than any text-based
representation when all code points are valid as it doesn't require
syntax validation. But the benefits are huge for all small elements.
We did this in haproxy's peers protocol (used to synchronize internal
tables betweeen multiple nodes) because the most common types exchanged
were server identifiers (typically values lower than 10 for 95% of
deployments), and incremental counters (up to 64-bit byte counts) and
we didn't want to use multiple types. By proceeding like this we can
ensure that implementations are not very difficult and can accept
limitations depending on their targetted use cases. And the bandwidth
remains as small as possible.

By the way it is important to keep in mind that a data type is not
necessarily related to the programming language's internal representation.
IP addresses are not numbers, identifiers are not numbers, eventhough
they can often be represented as such for convenience. Numbers have a
fairly different distribution with far more small values than large ones.
Identifiers (and addresses) on the opposite are more or less uniformly
distributed and do not benefit from variable length compression.

Cheers,
Willy

Received on Wednesday, 1 November 2017 06:41:38 UTC