- From: Kazuho Oku <kazuhooku@gmail.com>
- Date: Thu, 2 Nov 2017 10:23:20 +0900
- To: Mark Nottingham <mnot@mnot.net>
- Cc: Willy Tarreau <w@1wt.eu>, Poul-Henning Kamp <phk@phk.freebsd.dk>, HTTP Working Group <ietf-http-wg@w3.org>
Hi Mark, Thank you for the response. 2017-11-02 8:42 GMT+09:00 Mark Nottingham <mnot@mnot.net>: > Just a thought; maybe we shouldn't be defining "numbers" here, but instead "i32" or similar. I am not sure if that is a good solution the issue, even in the case the intent of Structured Headers is to address the 80% case. IMO, we need to at least have a representation that can carry the size of the file, which cannot always be represented as a i32 value. So the introduction of a sized type (i.e. i32, i64) means that we would need to have _two_ decoders for numbers, instead of one. My question here is what the merit of having two decoders is. Consider the case of a memory constrained HTTP client that can only handle int32_t. When it sees a content-length value in i64 form (e.g: `content-length: 1234567890123u64`), it would fail to handle the response. That's exactly the same as what we see now with the use of numbers without type specifiers (e.g. `content-length: 1234567890123`). So, I do not see why you want to have multiple number types (with different limits). Am I missing something here; e.g., a possibility of having a more graceful error handling, that can only be achieved through the introduction of sized types? > The intent of Structured Headers -- in my mind -- is not to address every conceivable use case for HTTP headers; rather it's to address the 80% case. There will always be people who need/want bespoke formats. If we can hit the 80% case (or more) with n bits, people who need more can use another format -- and if there's enough demand, we can add a different structure for that. > > Cheers, > > >> On 1 Nov 2017, at 5:41 pm, Willy Tarreau <w@1wt.eu> wrote: >> >> Hi Kazuho, >> >> On Wed, Nov 01, 2017 at 10:52:53AM +0900, Kazuho Oku wrote: >>> How long is the expected lifetime of Structured Headers? Assuming that >>> it would be used for 20 years (HTTP has been used for 20+ years, TCP >>> is used for 40+ years), there is fair chance that the 49¾ bits limit >>> is too small. Note that even if we switch to transferring headers in >>> binary-encoded forms, we might continue using Structured Headers for >>> textual representation. >>> >>> Do we want to risk making _all_ our future implementations complex in >>> exchange of being friendly to _some_ programming languages without >>> 64-bit integers? >> >> That's an interesting question that cannot be solved just by a Yes or a >> No. Making a language totally unable to implement a protocol (especially >> HTTP) is a no-go, and may even ignite the proposal of alternatives for >> some parts. So we must at least ensure that it is reasonably possible >> to implement the protocol even if that requires a little bit of efforts >> and if performance sucks, because people choosing such languages despite >> such limitatons will do it only for convenience and the languages will >> evolve to make their lifes easier in the future. What must really be >> avoided is everything requiring a full-range 64-bit internal >> representation all the time. But if 64-bit are needed only for large >> files, most developers will consider that their implementation is >> sufficient for *their* use cases (even if only 31 or 32 bits). >> >> This is what the text-based integer representation has brought us over >> the last two decades : good interoperability between implementations with >> very different limits. The ESP8266 in my alarm clock with 50kB of RAM >> might very well be using 16-bit integers for content-length and despite >> this it's compatible with the rest of the world. Similarly haproxy's >> chunk size parser used to be limited to 28 bits for a while and was >> only recently raised to 32 after hitting this limit once. >> >>> The other thing I would like to point out is that mandating support >>> for 64-bit integer fields does not necessary mean that you cannot >>> easily represent such kind of fields when using the programming >>> languages without 64-bit integers. >> >> It only depends if all bits of the fields are always needed or not in >> general. If it's just a size, anyone can decide that limiting their >> implementation to 32-bit can be OK for their purpose. >> >>> This is because there is no need to store an integer field using >>> integers. Decoders of Structured Headers can retain the representation >>> as a string (i.e. series of digits), and applications can convert them >>> to numbers when they want to use the value for calculation. >> >> It can indeed be an option as well. A punishment I would say. >> >>> Since the size of the files transmitted today do not exceed 1PB, such >>> approach will not have any issues today. As they start handling files >>> larger than 1PB, they will figure out how to support 64-bit integers >>> anyways. Otherwise they cannot access the file! Considering that, I >>> would argue that we are unlikely to see issues in the future as well, >>> with programming languages that do not support 64-bit integers _now_. >> >> I totally agree with this. I like to optimize for valid use cases, and >> in general use cases vary with implementations. Similarly I want my code >> to be fast on fast machines (because people by fast machines for >> performance) and small on resource constrained machines. People adapt >> their hardware and software to their needs, the design must scale, not >> necessarily the implementations. >> >>> To summarize, 49¾ bits limit is scary considering the expected >>> lifetime of a standard, and we can expect programming languages that >>> do not support 64-bit integers to start supporting them as we start >>> using files of petabyte size. >> >> I think we can solve such issues by specifying some protocol limits >> depending on implementations. Not doing this is what has caused some >> issues in the past. Content-lengths larger than 2^32 causing some >> implementations to wrap for example have been used to cause request >> smuggling attacks. But by insisting on boundary checking for the >> critical parts of the protocol depending on the storage type (for >> well known types), we can at least help implementers remain safe. >> >>>> If your 64 bit number is an identifier, the only valid operation >>>> on it is "check for identity", and taking the detour over a decimal >>>> representation is not only uncalled for, but also very inefficient >>>> in terms of CPU cycles. >>>> >>>> The natural and most efficient format for such an identifier would >>>> be base64 binary, but if for some reason it has to be decimal, say >>>> convenience for human debuggers, one could prefix it with a "i" and >>>> send it as a label. >>> >>> Requiring the use of base64 goes against the merit of using a textual >>> representation. The reason we use textual representation is because it >>> is easy for us to read and use. On most systems, 64-bit IDs are >>> represented as numbers. So people would want to transmit them in the >>> same representation over HTTP as well. So to me it seems that it is >>> whether we want 64-bit integers to be sent as numbers of strings (or >>> labels). That is the reason why I only compared the two options in my >>> previous mail. >> >> There is also the option of considering such identifiers as arrays of >> 32-bit for implementers, since they don't need to perform operations >> on them. This is something we can explain in a spec (ie: how to parse >> identifiers in general). >> >>> In this respect, another issue we should consider is that we can more >>> effectively compress the data if we know that it is a number >>> (comparing to compressing it as a text or a label). >> >> Yep, I like the principle of variable length integers. It's not very >> efficient in CPU cycles but is still much more than any text-based >> representation when all code points are valid as it doesn't require >> syntax validation. But the benefits are huge for all small elements. >> We did this in haproxy's peers protocol (used to synchronize internal >> tables betweeen multiple nodes) because the most common types exchanged >> were server identifiers (typically values lower than 10 for 95% of >> deployments), and incremental counters (up to 64-bit byte counts) and >> we didn't want to use multiple types. By proceeding like this we can >> ensure that implementations are not very difficult and can accept >> limitations depending on their targetted use cases. And the bandwidth >> remains as small as possible. >> >> By the way it is important to keep in mind that a data type is not >> necessarily related to the programming language's internal representation. >> IP addresses are not numbers, identifiers are not numbers, eventhough >> they can often be represented as such for convenience. Numbers have a >> fairly different distribution with far more small values than large ones. >> Identifiers (and addresses) on the opposite are more or less uniformly >> distributed and do not benefit from variable length compression. >> >> Cheers, >> Willy > > -- > Mark Nottingham https://www.mnot.net/ > -- Kazuho Oku
Received on Thursday, 2 November 2017 01:23:44 UTC