Re: ixampl goes Unicode from Steven Pemberton on 2022-08-17 (public-ixml@w3.org from August 2022)

From: Steven Pemberton <steven.pemberton@cwi.nl>
Date: Wed, 17 Aug 2022 09:53:08 +0000
To: ixml <public-ixml@w3.org>
Message-Id: <1660729893663.3345529151.4134707160@cwi.nl>

Initial timing experiments suggest it slows the implementation down by 
about 3%.
Steven

On Monday 15 August 2022 18:07:44 (+02:00), Steven Pemberton wrote:

 > A weird thing happened yesterday, quite unexpected (to me): I got ixampl 
working with Unicode characters.
 > I'd never thought it possible, because ABC has only 8 bit characters, 
and they are atomic: no bit operations, no conversion functions, and UTF-8 
is always described in terms of bit patterns.
 >
 > And then yesterday, I had a brainwave. There are only 256 bytes. 128 of 
them are ASCII, and they just represent themselves (that's the reason UTF-8 
exists).
 >
 > Of the other non-ASCII characters, they all play a single role in any 
UTF-8 string:
 >
 > [#C0-#DF] are leading bytes of a 2 byte character
 > [#E0-#EF] are leading bytes of a 3 byte character
 > [#F0-#F7] are leading bytes of a 4 byte character.
 > [#80-#BF] are continuation bytes of the multibyte characters,
 > and [#F8-#FF] are illegal.
 >
 > What this meant was that I could make a 256 long byte array, start, 
where each entry describes that role: 0 for continuations, 1 for ASCII, 2 
for leading byte of 2 byte characters and so on for 3 and 4.
 >
 > In ABC the | operator delivers the first n bytes of a string
 >
 > "dishonest" | 4 = "dish"
 >
 > so to extract the next Unicode character from a string s, all I have to 
do is
 >
 > s|start[s|1]
 >
 > Bingo!
 >
 > The new ixml is not online yet: just running the regression tests.
 >
 > Steven
 >
 >

Received on Wednesday, 17 August 2022 09:53:24 UTC