Re: ixampl goes Unicode from Steven Pemberton on 2022-08-16 (public-ixml@w3.org from August 2022)

From: Steven Pemberton <steven.pemberton@cwi.nl>
Date: Tue, 16 Aug 2022 07:44:05 +0000
To: ixml <public-ixml@w3.org>
Message-Id: <1660635827927.801030448.418227927@cwi.nl>

s/Unicode/UTF-8/G

On Tuesday 16 August 2022 08:56:06 (+02:00), Steven Pemberton wrote:



 > And then yesterday, I had a brainwave. There are only 256 bytes. 128 of 
them are ASCII, and they just represent themselves (that's the reason UTF-8 
exists).
 >
 > Of the other non-ASCII characters, they all play a single role in any 
UTF-8 string:
 >
 > [#C0-#DF] are leading bytes of a 2 byte character
 > [#E0-#EF] are leading bytes of a 3 byte character
 > [#F0-#F7] are leading bytes of a 4 byte character.
 > [#80-#BF] are continuation bytes of the multibyte characters,
 > and [#F8-#FF] are illegal.


I realised under the shower this morning that this means that Unicode is 
context-free. Using a byte-oriented input stream, you can describe Unicode 
as:
 unicode: 
char*.
 char: ascii; 
       h2, c;
       h3, c, c.
       h4, c, c, c.
 ascii: [#0-#7f].
 h2: [#c0-#df].
 h3: [#e0-#ef].
 h4: [#f0-#f7].
 c:  
[#80-#bf].
Steven

Received on Tuesday, 16 August 2022 07:44:18 UTC