Re: New Version Notification for draft-nottingham-structured-headers-00.txt from Kazuho Oku on 2017-11-02 (ietf-http-wg@w3.org from October to December 2017)

From: Kazuho Oku <kazuhooku@gmail.com>
Date: Thu, 2 Nov 2017 10:23:20 +0900
To: Mark Nottingham <mnot@mnot.net>
Cc: Willy Tarreau <w@1wt.eu>, Poul-Henning Kamp <phk@phk.freebsd.dk>, HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <CANatvzy4iX7qeA_GN0ak+4yEGwD3GP9imFiyrQ6Y2ntYUK4Agw@mail.gmail.com>
Hi Mark,

Thank you for the response.

2017-11-02 8:42 GMT+09:00 Mark Nottingham <mnot@mnot.net>:
> Just a thought; maybe we shouldn't be defining "numbers" here, but instead "i32" or similar.

I am not sure if that is a good solution the issue, even in the case
the intent of Structured Headers is to address the 80% case.

IMO, we need to at least have a representation that can carry the size
of the file, which cannot always be represented as a i32 value. So the
introduction of a sized type (i.e. i32, i64) means that we would need
to have _two_ decoders for numbers, instead of one.

My question here is what the merit of having two decoders is.

Consider the case of a memory constrained HTTP client that can only
handle int32_t. When it sees a content-length value in i64 form (e.g:
`content-length: 1234567890123u64`), it would fail to handle the
response. That's exactly the same as what we see now with the use of
numbers without type specifiers (e.g. `content-length:
1234567890123`).

So, I do not see why you want to have multiple number types (with
different limits).

Am I missing something here; e.g., a possibility of having a more
graceful error handling, that can only be achieved through the
introduction of sized types?

> The intent of Structured Headers -- in my mind -- is not to address every conceivable use case for HTTP headers; rather it's to address the 80% case. There will always be people who need/want bespoke formats. If we can hit the 80% case (or more) with n bits, people who need more can use another format -- and if there's enough demand, we can add a different structure for that.
>
> Cheers,
>
>
>> On 1 Nov 2017, at 5:41 pm, Willy Tarreau <w@1wt.eu> wrote:
>>
>> Hi Kazuho,
>>
>> On Wed, Nov 01, 2017 at 10:52:53AM +0900, Kazuho Oku wrote:
>>> How long is the expected lifetime of Structured Headers? Assuming that
>>> it would be used for 20 years (HTTP has been used for 20+ years, TCP
>>> is used for 40+ years), there is fair chance that the 49¾ bits limit
>>> is too small. Note that even if we switch to transferring headers in
>>> binary-encoded forms, we might continue using Structured Headers for
>>> textual representation.
>>>
>>> Do we want to risk making _all_ our future implementations complex in
>>> exchange of being friendly to _some_ programming languages without
>>> 64-bit integers?
>>
>> That's an interesting question that cannot be solved just by a Yes or a
>> No. Making a language totally unable to implement a protocol (especially
>> HTTP) is a no-go, and may even ignite the proposal of alternatives for
>> some parts. So we must at least ensure that it is reasonably possible
>> to implement the protocol even if that requires a little bit of efforts
>> and if performance sucks, because people choosing such languages despite
>> such limitatons will do it only for convenience and the languages will
>> evolve to make their lifes easier in the future. What must really be
>> avoided is everything requiring a full-range 64-bit internal
>> representation all the time. But if 64-bit are needed only for large
>> files, most developers will consider that their implementation is
>> sufficient for *their* use cases (even if only 31 or 32 bits).
>>
>> This is what the text-based integer representation has brought us over
>> the last two decades : good interoperability between implementations with
>> very different limits. The ESP8266 in my alarm clock with 50kB of RAM
>> might very well be using 16-bit integers for content-length and despite
>> this it's compatible with the rest of the world. Similarly haproxy's
>> chunk size parser used to be limited to 28 bits for a while and was
>> only recently raised to 32 after hitting this limit once.
>>
>>> The other thing I would like to point out is that mandating support
>>> for 64-bit integer fields does not necessary mean that you cannot
>>> easily represent such kind of fields when using the programming
>>> languages without 64-bit integers.
>>
>> It only depends if all bits of the fields are always needed or not in
>> general. If it's just a size, anyone can decide that limiting their
>> implementation to 32-bit can be OK for their purpose.
>>
>>> This is because there is no need to store an integer field using
>>> integers. Decoders of Structured Headers can retain the representation
>>> as a string (i.e. series of digits), and applications can convert them
>>> to numbers when they want to use the value for calculation.
>>
>> It can indeed be an option as well. A punishment I would say.
>>
>>> Since the size of the files transmitted today do not exceed 1PB, such
>>> approach will not have any issues today. As they start handling files
>>> larger than 1PB, they will figure out how to support 64-bit integers
>>> anyways. Otherwise they cannot access the file! Considering that, I
>>> would argue that we are unlikely to see issues in the future as well,
>>> with programming languages that do not support 64-bit integers _now_.
>>
>> I totally agree with this. I like to optimize for valid use cases, and
>> in general use cases vary with implementations. Similarly I want my code
>> to be fast on fast machines (because people by fast machines for
>> performance) and small on resource constrained machines. People adapt
>> their hardware and software to their needs, the design must scale, not
>> necessarily the implementations.
>>
>>> To summarize, 49¾ bits limit is scary considering the expected
>>> lifetime of a standard, and we can expect programming languages that
>>> do not support 64-bit integers to start supporting them as we start
>>> using files of petabyte size.
>>
>> I think we can solve such issues by specifying some protocol limits
>> depending on implementations. Not doing this is what has caused some
>> issues in the past. Content-lengths larger than 2^32 causing some
>> implementations to wrap for example have been used to cause request
>> smuggling attacks. But by insisting on boundary checking for the
>> critical parts of the protocol depending on the storage type (for
>> well known types), we can at least help implementers remain safe.
>>
>>>> If your 64 bit number is an identifier, the only valid operation
>>>> on it is "check for identity", and taking the detour over a decimal
>>>> representation is not only uncalled for, but also very inefficient
>>>> in terms of CPU cycles.
>>>>
>>>> The natural and most efficient format for such an identifier would
>>>> be base64 binary, but if for some reason it has to be decimal, say
>>>> convenience for human debuggers, one could prefix it with a "i" and
>>>> send it as a label.
>>>
>>> Requiring the use of base64 goes against the merit of using a textual
>>> representation. The reason we use textual representation is because it
>>> is easy for us to read and use. On most systems, 64-bit IDs are
>>> represented as numbers. So people would want to transmit them in the
>>> same representation over HTTP as well. So to me it seems that it is
>>> whether we want 64-bit integers to be sent as numbers of strings (or
>>> labels). That is the reason why I only compared the two options in my
>>> previous mail.
>>
>> There is also the option of considering such identifiers as arrays of
>> 32-bit for implementers, since they don't need to perform operations
>> on them. This is something we can explain in a spec (ie: how to parse
>> identifiers in general).
>>
>>> In this respect, another issue we should consider is that we can more
>>> effectively compress the data if we know that it is a number
>>> (comparing to compressing it as a text or a label).
>>
>> Yep, I like the principle of variable length integers. It's not very
>> efficient in CPU cycles but is still much more than any text-based
>> representation when all code points are valid as it doesn't require
>> syntax validation. But the benefits are huge for all small elements.
>> We did this in haproxy's peers protocol (used to synchronize internal
>> tables betweeen multiple nodes) because the most common types exchanged
>> were server identifiers (typically values lower than 10 for 95% of
>> deployments), and incremental counters (up to 64-bit byte counts) and
>> we didn't want to use multiple types. By proceeding like this we can
>> ensure that implementations are not very difficult and can accept
>> limitations depending on their targetted use cases. And the bandwidth
>> remains as small as possible.
>>
>> By the way it is important to keep in mind that a data type is not
>> necessarily related to the programming language's internal representation.
>> IP addresses are not numbers, identifiers are not numbers, eventhough
>> they can often be represented as such for convenience. Numbers have a
>> fairly different distribution with far more small values than large ones.
>> Identifiers (and addresses) on the opposite are more or less uniformly
>> distributed and do not benefit from variable length compression.
>>
>> Cheers,
>> Willy
>
> --
> Mark Nottingham   https://www.mnot.net/
>



-- 
Kazuho Oku
Received on Thursday, 2 November 2017 01:23:44 UTC