Re: New Version Notification for draft-nottingham-structured-headers-00.txt

Just a thought; maybe we shouldn't be defining "numbers" here, but instead "i32" or similar. 

The intent of Structured Headers -- in my mind -- is not to address every conceivable use case for HTTP headers; rather it's to address the 80% case. There will always be people who need/want bespoke formats. If we can hit the 80% case (or more) with n bits, people who need more can use another format -- and if there's enough demand, we can add a different structure for that.

Cheers,


> On 1 Nov 2017, at 5:41 pm, Willy Tarreau <w@1wt.eu> wrote:
> 
> Hi Kazuho,
> 
> On Wed, Nov 01, 2017 at 10:52:53AM +0900, Kazuho Oku wrote:
>> How long is the expected lifetime of Structured Headers? Assuming that
>> it would be used for 20 years (HTTP has been used for 20+ years, TCP
>> is used for 40+ years), there is fair chance that the 49¾ bits limit
>> is too small. Note that even if we switch to transferring headers in
>> binary-encoded forms, we might continue using Structured Headers for
>> textual representation.
>> 
>> Do we want to risk making _all_ our future implementations complex in
>> exchange of being friendly to _some_ programming languages without
>> 64-bit integers?
> 
> That's an interesting question that cannot be solved just by a Yes or a
> No. Making a language totally unable to implement a protocol (especially
> HTTP) is a no-go, and may even ignite the proposal of alternatives for
> some parts. So we must at least ensure that it is reasonably possible
> to implement the protocol even if that requires a little bit of efforts
> and if performance sucks, because people choosing such languages despite
> such limitatons will do it only for convenience and the languages will
> evolve to make their lifes easier in the future. What must really be
> avoided is everything requiring a full-range 64-bit internal
> representation all the time. But if 64-bit are needed only for large
> files, most developers will consider that their implementation is
> sufficient for *their* use cases (even if only 31 or 32 bits).
> 
> This is what the text-based integer representation has brought us over
> the last two decades : good interoperability between implementations with
> very different limits. The ESP8266 in my alarm clock with 50kB of RAM
> might very well be using 16-bit integers for content-length and despite
> this it's compatible with the rest of the world. Similarly haproxy's
> chunk size parser used to be limited to 28 bits for a while and was
> only recently raised to 32 after hitting this limit once.
> 
>> The other thing I would like to point out is that mandating support
>> for 64-bit integer fields does not necessary mean that you cannot
>> easily represent such kind of fields when using the programming
>> languages without 64-bit integers.
> 
> It only depends if all bits of the fields are always needed or not in
> general. If it's just a size, anyone can decide that limiting their
> implementation to 32-bit can be OK for their purpose.
> 
>> This is because there is no need to store an integer field using
>> integers. Decoders of Structured Headers can retain the representation
>> as a string (i.e. series of digits), and applications can convert them
>> to numbers when they want to use the value for calculation.
> 
> It can indeed be an option as well. A punishment I would say.
> 
>> Since the size of the files transmitted today do not exceed 1PB, such
>> approach will not have any issues today. As they start handling files
>> larger than 1PB, they will figure out how to support 64-bit integers
>> anyways. Otherwise they cannot access the file! Considering that, I
>> would argue that we are unlikely to see issues in the future as well,
>> with programming languages that do not support 64-bit integers _now_.
> 
> I totally agree with this. I like to optimize for valid use cases, and
> in general use cases vary with implementations. Similarly I want my code
> to be fast on fast machines (because people by fast machines for
> performance) and small on resource constrained machines. People adapt
> their hardware and software to their needs, the design must scale, not
> necessarily the implementations.
> 
>> To summarize, 49¾ bits limit is scary considering the expected
>> lifetime of a standard, and we can expect programming languages that
>> do not support 64-bit integers to start supporting them as we start
>> using files of petabyte size.
> 
> I think we can solve such issues by specifying some protocol limits
> depending on implementations. Not doing this is what has caused some
> issues in the past. Content-lengths larger than 2^32 causing some
> implementations to wrap for example have been used to cause request
> smuggling attacks. But by insisting on boundary checking for the
> critical parts of the protocol depending on the storage type (for
> well known types), we can at least help implementers remain safe.
> 
>>> If your 64 bit number is an identifier, the only valid operation
>>> on it is "check for identity", and taking the detour over a decimal
>>> representation is not only uncalled for, but also very inefficient
>>> in terms of CPU cycles.
>>> 
>>> The natural and most efficient format for such an identifier would
>>> be base64 binary, but if for some reason it has to be decimal, say
>>> convenience for human debuggers, one could prefix it with a "i" and
>>> send it as a label.
>> 
>> Requiring the use of base64 goes against the merit of using a textual
>> representation. The reason we use textual representation is because it
>> is easy for us to read and use. On most systems, 64-bit IDs are
>> represented as numbers. So people would want to transmit them in the
>> same representation over HTTP as well. So to me it seems that it is
>> whether we want 64-bit integers to be sent as numbers of strings (or
>> labels). That is the reason why I only compared the two options in my
>> previous mail.
> 
> There is also the option of considering such identifiers as arrays of
> 32-bit for implementers, since they don't need to perform operations
> on them. This is something we can explain in a spec (ie: how to parse
> identifiers in general).
> 
>> In this respect, another issue we should consider is that we can more
>> effectively compress the data if we know that it is a number
>> (comparing to compressing it as a text or a label).
> 
> Yep, I like the principle of variable length integers. It's not very
> efficient in CPU cycles but is still much more than any text-based
> representation when all code points are valid as it doesn't require
> syntax validation. But the benefits are huge for all small elements.
> We did this in haproxy's peers protocol (used to synchronize internal
> tables betweeen multiple nodes) because the most common types exchanged
> were server identifiers (typically values lower than 10 for 95% of
> deployments), and incremental counters (up to 64-bit byte counts) and
> we didn't want to use multiple types. By proceeding like this we can
> ensure that implementations are not very difficult and can accept
> limitations depending on their targetted use cases. And the bandwidth
> remains as small as possible.
> 
> By the way it is important to keep in mind that a data type is not
> necessarily related to the programming language's internal representation.
> IP addresses are not numbers, identifiers are not numbers, eventhough
> they can often be represented as such for convenience. Numbers have a
> fairly different distribution with far more small values than large ones.
> Identifiers (and addresses) on the opposite are more or less uniformly
> distributed and do not benefit from variable length compression.
> 
> Cheers,
> Willy

--
Mark Nottingham   https://www.mnot.net/

Received on Wednesday, 1 November 2017 23:43:23 UTC