- From: Willy Tarreau <w@1wt.eu>
- Date: Sun, 28 May 2023 10:03:09 +0200
- To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Cc: Julian Reschke <julian.reschke@gmx.de>, ietf-http-wg@w3.org
On Sun, May 28, 2023 at 04:28:52PM +0900, Martin J. Dürst wrote: > Hello Willy, Julian, others, > > There was a time (way back) when only the basic multilingual plane (i.e. a > 16-bit space) had characters assigned. That turned out to not be enough, but > it had the desirable side effect of keeping things compact. In UTF-8, that > space can be covered by 3 bytes max per character, and it may have been that > there were some implementations limited to 3 bytes max because they thought > there wouldn't be any characters in the rest of the codespace. > > UTF-8 itself was defined to use up to 6 bytes per character, because it was > covering the full 32-bit space of the early ISO-10646 drafts. There were > definitely implementations that covered all that space. > > After some years, it became clear that a 16-bit space was not enough, but a > 32-bit space was way too much. ISO and Unicode agreed on 17 planes of 16 > bits, leading to an overall code space from U+0000 to U+10FFFF. As a result, > the definition of UTF-8 was restricted to 4 bytes max per character (see RFC > 3629, e.g. https://datatracker.ietf.org/doc/html/rfc3629#section-4, or your > favorite Unicode version, or ISO 10646). Thanks for the background Martin ;-) Willy
Received on Sunday, 28 May 2023 08:03:21 UTC