- From: Steven Pemberton <steven.pemberton@cwi.nl>
- Date: Tue, 16 Aug 2022 06:56:06 +0000
- To: ixml <public-ixml@w3.org>
Received on Tuesday, 16 August 2022 06:56:23 UTC
> And then yesterday, I had a brainwave. There are only 256 bytes. 128 of
them are ASCII, and they just represent themselves (that's the reason UTF-8
exists).
>
> Of the other non-ASCII characters, they all play a single role in any
UTF-8 string:
>
> [#C0-#DF] are leading bytes of a 2 byte character
> [#E0-#EF] are leading bytes of a 3 byte character
> [#F0-#F7] are leading bytes of a 4 byte character.
> [#80-#BF] are continuation bytes of the multibyte characters,
> and [#F8-#FF] are illegal.
I realised under the shower this morning that this means that Unicode is
context-free. Using a byte-oriented input stream, you can describe Unicode
as:
unicode: char*.
char: ascii;
h2, c;
h3, c, c.
h4, c, c, c.
ascii: [#0-#7f].
h2: [#c0-#df].
h3: [#e0-#ef].
h4: [#f0-#f7].
c: [#80-#bf].
Steven
Received on Tuesday, 16 August 2022 06:56:23 UTC