- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Sun, 28 May 2023 16:28:52 +0900
- To: Willy Tarreau <w@1wt.eu>, Julian Reschke <julian.reschke@gmx.de>
- Cc: ietf-http-wg@w3.org
Hello Willy, Julian, others, There was a time (way back) when only the basic multilingual plane (i.e. a 16-bit space) had characters assigned. That turned out to not be enough, but it had the desirable side effect of keeping things compact. In UTF-8, that space can be covered by 3 bytes max per character, and it may have been that there were some implementations limited to 3 bytes max because they thought there wouldn't be any characters in the rest of the codespace. UTF-8 itself was defined to use up to 6 bytes per character, because it was covering the full 32-bit space of the early ISO-10646 drafts. There were definitely implementations that covered all that space. After some years, it became clear that a 16-bit space was not enough, but a 32-bit space was way too much. ISO and Unicode agreed on 17 planes of 16 bits, leading to an overall code space from U+0000 to U+10FFFF. As a result, the definition of UTF-8 was restricted to 4 bytes max per character (see RFC 3629, e.g. https://datatracker.ietf.org/doc/html/rfc3629#section-4, or your favorite Unicode version, or ISO 10646). On 2023-05-28 14:05, Willy Tarreau wrote: > On Sun, May 28, 2023 at 05:51:49AM +0200, Julian Reschke wrote: >> AFAIU, the UTF-8 encoding/decoding function (sequence of code points to >> octets and vice versa) never has changed (see >> https://datatracker.ietf.org/doc/html/rfc3629#section-3). Am I missing >> something here? The actual mapping function at the places it matter indeed hasn't changed. But the domain and range have changed from the early max 6 bytes to the current max 4 bytes. Regards, Martin. > No you're indeed right. But I have clear memories of this "common" > approach of iterating over a string as long as (c & 0xc0) == 0x80 > (which was the main concern) as well as the possibility of larger > code sequences they didn't want to support (that was in early > 2000/2001). I'm still seeing traces of this in the FSS-UTF proposal: > > https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt > > Bits Hex Min Hex Max Byte Sequence in Binary > 1 7 00000000 0000007f 0vvvvvvv > 2 11 00000080 000007FF 110vvvvv 10vvvvvv > 3 16 00000800 0000FFFF 1110vvvv 10vvvvvv 10vvvvvv > 4 21 00010000 001FFFFF 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv > 5 26 00200000 03FFFFFF 111110vv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv > 6 31 04000000 7FFFFFFF 1111110v 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv > > So maybe back then I only had to implement the 16-bit one and they > later wanted to support the 21-bit one as well, I don't remember the > exact details. But there's less risk if the standardized codes have > a fixed maximum length, I agree. I just don't want to have to validate > them when forwarding header fields ;-) > > Regards, > willy >
Received on Sunday, 28 May 2023 07:29:02 UTC