- From: Steven Pemberton <steven.pemberton@cwi.nl>
- Date: Mon, 15 Aug 2022 16:07:44 +0000
- To: ixml <public-ixml@w3.org>
A weird thing happened yesterday, quite unexpected (to me): I got ixampl working with Unicode characters. I'd never thought it possible, because ABC has only 8 bit characters, and they are atomic: no bit operations, no conversion functions, and UTF-8 is always described in terms of bit patterns. And then yesterday, I had a brainwave. There are only 256 bytes. 128 of them are ASCII, and they just represent themselves (that's the reason UTF-8 exists). Of the other non-ASCII characters, they all play a single role in any UTF-8 string: [#C0-#DF] are leading bytes of a 2 byte character [#E0-#EF] are leading bytes of a 3 byte character [#F0-#F7] are leading bytes of a 4 byte character. [#80-#BF] are continuation bytes of the multibyte characters, and [#F8-#FF] are illegal. What this meant was that I could make a 256 long byte array, start, where each entry describes that role: 0 for continuations, 1 for ASCII, 2 for leading byte of 2 byte characters and so on for 3 and 4. In ABC the | operator delivers the first n bytes of a string "dishonest" | 4 = "dish" so to extract the next Unicode character from a string s, all I have to do is s|start[s|1] Bingo! The new ixml is not online yet: just running the regression tests. Steven
Received on Monday, 15 August 2022 16:07:58 UTC