Re: New Version Notification for draft-nottingham-structured-headers-00.txt from Kazuho Oku on 2017-11-02 (ietf-http-wg@w3.org from October to December 2017)

From: Kazuho Oku <kazuhooku@gmail.com>
Date: Thu, 2 Nov 2017 14:18:23 +0900
To: Mark Nottingham <mnot@mnot.net>
Cc: Willy Tarreau <w@1wt.eu>, Poul-Henning Kamp <phk@phk.freebsd.dk>, HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <CANatvzyJFFM13vUi=BZ8so3bZQMXtuwNOJw+SHHeccKkJMJt=w@mail.gmail.com>
Hi Mark,

2017-11-02 12:21 GMT+09:00 Mark Nottingham <mnot@mnot.net>:
> Hi Kazuho,
>
>> On 2 Nov 2017, at 12:23 pm, Kazuho Oku <kazuhooku@gmail.com> wrote:
>>
>> Hi Mark,
>>
>> Thank you for the response.
>>
>> 2017-11-02 8:42 GMT+09:00 Mark Nottingham <mnot@mnot.net>:
>>> Just a thought; maybe we shouldn't be defining "numbers" here, but instead "i32" or similar.
>>
>> I am not sure if that is a good solution the issue, even in the case
>> the intent of Structured Headers is to address the 80% case.
>>
>> IMO, we need to at least have a representation that can carry the size
>> of the file, which cannot always be represented as a i32 value. So the
>> introduction of a sized type (i.e. i32, i64) means that we would need
>> to have _two_ decoders for numbers, instead of one.
>>
>> My question here is what the merit of having two decoders is.
>
> If you need to support integers of such large size, I think we're already talking about two decoders, because we need to support floats too.

I can understand your point that we would be having at least two
parsers, one for integers and one for floats (or three, if we consider
positive and negative integers as different types). But still, I am
not sure if the fact justifies having multiple parsers for different
sizes of integer types, since, as I stated, I do not see if there
would be any practical merit (e.g., interoperability improvement).

>> Consider the case of a memory constrained HTTP client that can only
>> handle int32_t. When it sees a content-length value in i64 form (e.g:
>> `content-length: 1234567890123u64`), it would fail to handle the
>> response. That's exactly the same as what we see now with the use of
>> numbers without type specifiers (e.g. `content-length:
>> 1234567890123`).
>>
>> So, I do not see why you want to have multiple number types (with
>> different limits).
>>
>> Am I missing something here; e.g., a possibility of having a more
>> graceful error handling, that can only be achieved through the
>> introduction of sized types?
>
> It's an interesting question. We started down this path because there are interoperability problems in JSON with large numbers, and we wanted to avoid those. We're not JSON-based any more (although in some use cases, I suspect JSON will be used as an alternative serialisation).

I can understand that the hesitation to require support for integer
values exceeding 2^52 due to the fact that cannot be handled correctly
by using a native type of JavaScript (or possibly with some other
programming languages; are there any?).

But as Alex pointed out [1], JavaScript is likely going to support
bigint. I would also assume that programming languages that lack
support for 64-bit ints now would start supporting 64-bit ints as (or
before) we start seeing files above 2^52 bits.

> Most of the interop problems come around the edges of floats. Straw-man proposal: we could define float as 15 digits of precision (as we do now), and i64 for integers. That would give us good interop and a good range of capability. WDYT?

I'd be fine if the limit for int becomes 64 bits (though I am not sure
if we even want a limit here).

For floats, first let me state that I do not know if we even need to
support it. To me it seems worth considering to drop it, as PHK
suggested.

That said, let me point out that you might not need to define the
maximum number of digits permitted in fraction part. You are anyways
expected to have some errors in floating point maths. The reason we
had issues in JSON Header Field Values is because it permitted the use
of exponents (i.e. "eNN") [3]. Without support for exponents, such
issue would not happen in Structured Headers.

[1] https://lists.w3.org/Archives/Public/ietf-http-wg/2017OctDec/0134.html
[2] https://lists.w3.org/Archives/Public/ietf-http-wg/2017OctDec/0140.html
[3] https://lists.w3.org/Archives/Public/ietf-http-wg/2016JulSep/0154.html

>>
>>> The intent of Structured Headers -- in my mind -- is not to address every conceivable use case for HTTP headers; rather it's to address the 80% case. There will always be people who need/want bespoke formats. If we can hit the 80% case (or more) with n bits, people who need more can use another format -- and if there's enough demand, we can add a different structure for that.
>>>
>>> Cheers,
>>>
>>>
>>>> On 1 Nov 2017, at 5:41 pm, Willy Tarreau <w@1wt.eu> wrote:
>>>>
>>>> Hi Kazuho,
>>>>
>>>> On Wed, Nov 01, 2017 at 10:52:53AM +0900, Kazuho Oku wrote:
>>>>> How long is the expected lifetime of Structured Headers? Assuming that
>>>>> it would be used for 20 years (HTTP has been used for 20+ years, TCP
>>>>> is used for 40+ years), there is fair chance that the 49¾ bits limit
>>>>> is too small. Note that even if we switch to transferring headers in
>>>>> binary-encoded forms, we might continue using Structured Headers for
>>>>> textual representation.
>>>>>
>>>>> Do we want to risk making _all_ our future implementations complex in
>>>>> exchange of being friendly to _some_ programming languages without
>>>>> 64-bit integers?
>>>>
>>>> That's an interesting question that cannot be solved just by a Yes or a
>>>> No. Making a language totally unable to implement a protocol (especially
>>>> HTTP) is a no-go, and may even ignite the proposal of alternatives for
>>>> some parts. So we must at least ensure that it is reasonably possible
>>>> to implement the protocol even if that requires a little bit of efforts
>>>> and if performance sucks, because people choosing such languages despite
>>>> such limitatons will do it only for convenience and the languages will
>>>> evolve to make their lifes easier in the future. What must really be
>>>> avoided is everything requiring a full-range 64-bit internal
>>>> representation all the time. But if 64-bit are needed only for large
>>>> files, most developers will consider that their implementation is
>>>> sufficient for *their* use cases (even if only 31 or 32 bits).
>>>>
>>>> This is what the text-based integer representation has brought us over
>>>> the last two decades : good interoperability between implementations with
>>>> very different limits. The ESP8266 in my alarm clock with 50kB of RAM
>>>> might very well be using 16-bit integers for content-length and despite
>>>> this it's compatible with the rest of the world. Similarly haproxy's
>>>> chunk size parser used to be limited to 28 bits for a while and was
>>>> only recently raised to 32 after hitting this limit once.
>>>>
>>>>> The other thing I would like to point out is that mandating support
>>>>> for 64-bit integer fields does not necessary mean that you cannot
>>>>> easily represent such kind of fields when using the programming
>>>>> languages without 64-bit integers.
>>>>
>>>> It only depends if all bits of the fields are always needed or not in
>>>> general. If it's just a size, anyone can decide that limiting their
>>>> implementation to 32-bit can be OK for their purpose.
>>>>
>>>>> This is because there is no need to store an integer field using
>>>>> integers. Decoders of Structured Headers can retain the representation
>>>>> as a string (i.e. series of digits), and applications can convert them
>>>>> to numbers when they want to use the value for calculation.
>>>>
>>>> It can indeed be an option as well. A punishment I would say.
>>>>
>>>>> Since the size of the files transmitted today do not exceed 1PB, such
>>>>> approach will not have any issues today. As they start handling files
>>>>> larger than 1PB, they will figure out how to support 64-bit integers
>>>>> anyways. Otherwise they cannot access the file! Considering that, I
>>>>> would argue that we are unlikely to see issues in the future as well,
>>>>> with programming languages that do not support 64-bit integers _now_.
>>>>
>>>> I totally agree with this. I like to optimize for valid use cases, and
>>>> in general use cases vary with implementations. Similarly I want my code
>>>> to be fast on fast machines (because people by fast machines for
>>>> performance) and small on resource constrained machines. People adapt
>>>> their hardware and software to their needs, the design must scale, not
>>>> necessarily the implementations.
>>>>
>>>>> To summarize, 49¾ bits limit is scary considering the expected
>>>>> lifetime of a standard, and we can expect programming languages that
>>>>> do not support 64-bit integers to start supporting them as we start
>>>>> using files of petabyte size.
>>>>
>>>> I think we can solve such issues by specifying some protocol limits
>>>> depending on implementations. Not doing this is what has caused some
>>>> issues in the past. Content-lengths larger than 2^32 causing some
>>>> implementations to wrap for example have been used to cause request
>>>> smuggling attacks. But by insisting on boundary checking for the
>>>> critical parts of the protocol depending on the storage type (for
>>>> well known types), we can at least help implementers remain safe.
>>>>
>>>>>> If your 64 bit number is an identifier, the only valid operation
>>>>>> on it is "check for identity", and taking the detour over a decimal
>>>>>> representation is not only uncalled for, but also very inefficient
>>>>>> in terms of CPU cycles.
>>>>>>
>>>>>> The natural and most efficient format for such an identifier would
>>>>>> be base64 binary, but if for some reason it has to be decimal, say
>>>>>> convenience for human debuggers, one could prefix it with a "i" and
>>>>>> send it as a label.
>>>>>
>>>>> Requiring the use of base64 goes against the merit of using a textual
>>>>> representation. The reason we use textual representation is because it
>>>>> is easy for us to read and use. On most systems, 64-bit IDs are
>>>>> represented as numbers. So people would want to transmit them in the
>>>>> same representation over HTTP as well. So to me it seems that it is
>>>>> whether we want 64-bit integers to be sent as numbers of strings (or
>>>>> labels). That is the reason why I only compared the two options in my
>>>>> previous mail.
>>>>
>>>> There is also the option of considering such identifiers as arrays of
>>>> 32-bit for implementers, since they don't need to perform operations
>>>> on them. This is something we can explain in a spec (ie: how to parse
>>>> identifiers in general).
>>>>
>>>>> In this respect, another issue we should consider is that we can more
>>>>> effectively compress the data if we know that it is a number
>>>>> (comparing to compressing it as a text or a label).
>>>>
>>>> Yep, I like the principle of variable length integers. It's not very
>>>> efficient in CPU cycles but is still much more than any text-based
>>>> representation when all code points are valid as it doesn't require
>>>> syntax validation. But the benefits are huge for all small elements.
>>>> We did this in haproxy's peers protocol (used to synchronize internal
>>>> tables betweeen multiple nodes) because the most common types exchanged
>>>> were server identifiers (typically values lower than 10 for 95% of
>>>> deployments), and incremental counters (up to 64-bit byte counts) and
>>>> we didn't want to use multiple types. By proceeding like this we can
>>>> ensure that implementations are not very difficult and can accept
>>>> limitations depending on their targetted use cases. And the bandwidth
>>>> remains as small as possible.
>>>>
>>>> By the way it is important to keep in mind that a data type is not
>>>> necessarily related to the programming language's internal representation.
>>>> IP addresses are not numbers, identifiers are not numbers, eventhough
>>>> they can often be represented as such for convenience. Numbers have a
>>>> fairly different distribution with far more small values than large ones.
>>>> Identifiers (and addresses) on the opposite are more or less uniformly
>>>> distributed and do not benefit from variable length compression.
>>>>
>>>> Cheers,
>>>> Willy
>>>
>>> --
>>> Mark Nottingham   https://www.mnot.net/
>>>
>>
>>
>>
>> --
>> Kazuho Oku
>
> --
> Mark Nottingham   https://www.mnot.net/
>



-- 
Kazuho Oku
Received on Thursday, 2 November 2017 05:18:51 UTC