- From: Julian Reschke <julian.reschke@gmx.de>
- Date: Sun, 28 May 2023 13:32:33 +0200
- To: Martin J. Dürst <duerst@it.aoyama.ac.jp>, Willy Tarreau <w@1wt.eu>
- Cc: ietf-http-wg@w3.org
On 28.05.2023 09:28, Martin J. Dürst wrote: > Hello Willy, Julian, others, > > There was a time (way back) when only the basic multilingual plane (i.e. > a 16-bit space) had characters assigned. That turned out to not be > enough, but it had the desirable side effect of keeping things compact. > In UTF-8, that space can be covered by 3 bytes max per character, and it > may have been that there were some implementations limited to 3 bytes > max because they thought there wouldn't be any characters in the rest of > the codespace. > > UTF-8 itself was defined to use up to 6 bytes per character, because it > was covering the full 32-bit space of the early ISO-10646 drafts. There > were definitely implementations that covered all that space. > > After some years, it became clear that a 16-bit space was not enough, > but a 32-bit space was way too much. ISO and Unicode agreed on 17 planes > of 16 bits, leading to an overall code space from U+0000 to U+10FFFF. As > a result, the definition of UTF-8 was restricted to 4 bytes max per > character (see RFC 3629, e.g. > https://datatracker.ietf.org/doc/html/rfc3629#section-4, or your > favorite Unicode version, or ISO 10646). Martin, thanks for the wonderful explanation! Best regards, Julian
Received on Sunday, 28 May 2023 11:33:08 UTC