- From: Mark Nottingham <mnot@mnot.net>
- Date: Thu, 2 Nov 2017 10:42:53 +1100
- To: Willy Tarreau <w@1wt.eu>
- Cc: Kazuho Oku <kazuhooku@gmail.com>, Poul-Henning Kamp <phk@phk.freebsd.dk>, HTTP Working Group <ietf-http-wg@w3.org>
Just a thought; maybe we shouldn't be defining "numbers" here, but instead "i32" or similar. The intent of Structured Headers -- in my mind -- is not to address every conceivable use case for HTTP headers; rather it's to address the 80% case. There will always be people who need/want bespoke formats. If we can hit the 80% case (or more) with n bits, people who need more can use another format -- and if there's enough demand, we can add a different structure for that. Cheers, > On 1 Nov 2017, at 5:41 pm, Willy Tarreau <w@1wt.eu> wrote: > > Hi Kazuho, > > On Wed, Nov 01, 2017 at 10:52:53AM +0900, Kazuho Oku wrote: >> How long is the expected lifetime of Structured Headers? Assuming that >> it would be used for 20 years (HTTP has been used for 20+ years, TCP >> is used for 40+ years), there is fair chance that the 49¾ bits limit >> is too small. Note that even if we switch to transferring headers in >> binary-encoded forms, we might continue using Structured Headers for >> textual representation. >> >> Do we want to risk making _all_ our future implementations complex in >> exchange of being friendly to _some_ programming languages without >> 64-bit integers? > > That's an interesting question that cannot be solved just by a Yes or a > No. Making a language totally unable to implement a protocol (especially > HTTP) is a no-go, and may even ignite the proposal of alternatives for > some parts. So we must at least ensure that it is reasonably possible > to implement the protocol even if that requires a little bit of efforts > and if performance sucks, because people choosing such languages despite > such limitatons will do it only for convenience and the languages will > evolve to make their lifes easier in the future. What must really be > avoided is everything requiring a full-range 64-bit internal > representation all the time. But if 64-bit are needed only for large > files, most developers will consider that their implementation is > sufficient for *their* use cases (even if only 31 or 32 bits). > > This is what the text-based integer representation has brought us over > the last two decades : good interoperability between implementations with > very different limits. The ESP8266 in my alarm clock with 50kB of RAM > might very well be using 16-bit integers for content-length and despite > this it's compatible with the rest of the world. Similarly haproxy's > chunk size parser used to be limited to 28 bits for a while and was > only recently raised to 32 after hitting this limit once. > >> The other thing I would like to point out is that mandating support >> for 64-bit integer fields does not necessary mean that you cannot >> easily represent such kind of fields when using the programming >> languages without 64-bit integers. > > It only depends if all bits of the fields are always needed or not in > general. If it's just a size, anyone can decide that limiting their > implementation to 32-bit can be OK for their purpose. > >> This is because there is no need to store an integer field using >> integers. Decoders of Structured Headers can retain the representation >> as a string (i.e. series of digits), and applications can convert them >> to numbers when they want to use the value for calculation. > > It can indeed be an option as well. A punishment I would say. > >> Since the size of the files transmitted today do not exceed 1PB, such >> approach will not have any issues today. As they start handling files >> larger than 1PB, they will figure out how to support 64-bit integers >> anyways. Otherwise they cannot access the file! Considering that, I >> would argue that we are unlikely to see issues in the future as well, >> with programming languages that do not support 64-bit integers _now_. > > I totally agree with this. I like to optimize for valid use cases, and > in general use cases vary with implementations. Similarly I want my code > to be fast on fast machines (because people by fast machines for > performance) and small on resource constrained machines. People adapt > their hardware and software to their needs, the design must scale, not > necessarily the implementations. > >> To summarize, 49¾ bits limit is scary considering the expected >> lifetime of a standard, and we can expect programming languages that >> do not support 64-bit integers to start supporting them as we start >> using files of petabyte size. > > I think we can solve such issues by specifying some protocol limits > depending on implementations. Not doing this is what has caused some > issues in the past. Content-lengths larger than 2^32 causing some > implementations to wrap for example have been used to cause request > smuggling attacks. But by insisting on boundary checking for the > critical parts of the protocol depending on the storage type (for > well known types), we can at least help implementers remain safe. > >>> If your 64 bit number is an identifier, the only valid operation >>> on it is "check for identity", and taking the detour over a decimal >>> representation is not only uncalled for, but also very inefficient >>> in terms of CPU cycles. >>> >>> The natural and most efficient format for such an identifier would >>> be base64 binary, but if for some reason it has to be decimal, say >>> convenience for human debuggers, one could prefix it with a "i" and >>> send it as a label. >> >> Requiring the use of base64 goes against the merit of using a textual >> representation. The reason we use textual representation is because it >> is easy for us to read and use. On most systems, 64-bit IDs are >> represented as numbers. So people would want to transmit them in the >> same representation over HTTP as well. So to me it seems that it is >> whether we want 64-bit integers to be sent as numbers of strings (or >> labels). That is the reason why I only compared the two options in my >> previous mail. > > There is also the option of considering such identifiers as arrays of > 32-bit for implementers, since they don't need to perform operations > on them. This is something we can explain in a spec (ie: how to parse > identifiers in general). > >> In this respect, another issue we should consider is that we can more >> effectively compress the data if we know that it is a number >> (comparing to compressing it as a text or a label). > > Yep, I like the principle of variable length integers. It's not very > efficient in CPU cycles but is still much more than any text-based > representation when all code points are valid as it doesn't require > syntax validation. But the benefits are huge for all small elements. > We did this in haproxy's peers protocol (used to synchronize internal > tables betweeen multiple nodes) because the most common types exchanged > were server identifiers (typically values lower than 10 for 95% of > deployments), and incremental counters (up to 64-bit byte counts) and > we didn't want to use multiple types. By proceeding like this we can > ensure that implementations are not very difficult and can accept > limitations depending on their targetted use cases. And the bandwidth > remains as small as possible. > > By the way it is important to keep in mind that a data type is not > necessarily related to the programming language's internal representation. > IP addresses are not numbers, identifiers are not numbers, eventhough > they can often be represented as such for convenience. Numbers have a > fairly different distribution with far more small values than large ones. > Identifiers (and addresses) on the opposite are more or less uniformly > distributed and do not benefit from variable length compression. > > Cheers, > Willy -- Mark Nottingham https://www.mnot.net/
Received on Wednesday, 1 November 2017 23:43:23 UTC