Re: Consensus call to include Display Strings in draft-ietf-httpbis-sfbis

Hello Willy, Julian, others,

There was a time (way back) when only the basic multilingual plane (i.e. 
a 16-bit space) had characters assigned. That turned out to not be 
enough, but it had the desirable side effect of keeping things compact. 
In UTF-8, that space can be covered by 3 bytes max per character, and it 
may have been that there were some implementations limited to 3 bytes 
max because they thought there wouldn't be any characters in the rest of 
the codespace.

UTF-8 itself was defined to use up to 6 bytes per character, because it 
was covering the full 32-bit space of the early ISO-10646 drafts. There 
were definitely implementations that covered all that space.

After some years, it became clear that a 16-bit space was not enough, 
but a 32-bit space was way too much. ISO and Unicode agreed on 17 planes 
of 16 bits, leading to an overall code space from U+0000 to U+10FFFF. As 
a result, the definition of UTF-8 was restricted to 4 bytes max per 
character (see RFC 3629, e.g. 
https://datatracker.ietf.org/doc/html/rfc3629#section-4, or your 
favorite Unicode version, or ISO 10646).

On 2023-05-28 14:05, Willy Tarreau wrote:
> On Sun, May 28, 2023 at 05:51:49AM +0200, Julian Reschke wrote:

>> AFAIU, the UTF-8 encoding/decoding function (sequence of code points to
>> octets and vice versa) never has changed (see
>> https://datatracker.ietf.org/doc/html/rfc3629#section-3). Am I missing
>> something here?

The actual mapping function at the places it matter indeed hasn't 
changed. But the domain and range have changed from the early max 6 
bytes to the current max 4 bytes.

Regards,   Martin.

> No you're indeed right. But I have clear memories of this "common"
> approach of iterating over a string as long as (c & 0xc0) == 0x80
> (which was the main concern) as well as the possibility of larger
> code sequences they didn't want to support (that was in early
> 2000/2001). I'm still seeing traces of this in the FSS-UTF proposal:
> 
>    https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
> 
>       Bits  Hex Min  Hex Max  Byte Sequence in Binary
>    1    7  00000000 0000007f 0vvvvvvv
>    2   11  00000080 000007FF 110vvvvv 10vvvvvv
>    3   16  00000800 0000FFFF 1110vvvv 10vvvvvv 10vvvvvv
>    4   21  00010000 001FFFFF 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv
>    5   26  00200000 03FFFFFF 111110vv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv
>    6   31  04000000 7FFFFFFF 1111110v 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv
> 
> So maybe back then I only had to implement the 16-bit one and they
> later wanted to support the 21-bit one as well, I don't remember the
> exact details. But there's less risk if the standardized codes have
> a fixed maximum length, I agree. I just don't want to have to validate
> them when forwarding header fields ;-)
> 
> Regards,
> willy
> 

Received on Sunday, 28 May 2023 07:29:02 UTC