Re: ixampl goes Unicode

Steven,

If this hasn't been written up anywhere, it would be great as a very 
short paper. :)

Do you have a separate check for the illegal characters?

j


P.S. I'd love to see the Unicode classes (and generally, the entire ABC 
implementation) when you have a chance.


On 8/15/22 5:39 PM, Steven Pemberton wrote:
> It is now live.
> I haven't yet updated the Unicode character classes though.
>
> Steven
>
> On Monday 15 August 2022 18:07:44 (+02:00), Steven Pemberton wrote:
>
> > A weird thing happened yesterday, quite unexpected (to me): I got 
> ixampl working with Unicode characters.
> > I'd never thought it possible, because ABC has only 8 bit 
> characters, and they are atomic: no bit operations, no conversion 
> functions, and UTF-8 is always described in terms of bit patterns.
> >
> > And then yesterday, I had a brainwave. There are only 256 bytes. 128 
> of them are ASCII, and they just represent themselves (that's the 
> reason UTF-8 exists).
> >
> > Of the other non-ASCII characters, they all play a single role in 
> any UTF-8 string:
> >
> > [#C0-#DF] are leading bytes of a 2 byte character
> > [#E0-#EF] are leading bytes of a 3 byte character
> > [#F0-#F7] are leading bytes of a 4 byte character.
> > [#80-#BF] are continuation bytes of the multibyte characters,
> > and [#F8-#FF] are illegal.
> >
> > What this meant was that I could make a 256 long byte array, start, 
> where each entry describes that role: 0 for continuations, 1 for 
> ASCII, 2 for leading byte of 2 byte characters and so on for 3 and 4.
> >
> > In ABC the | operator delivers the first n bytes of a string
> >
> > "dishonest" | 4 = "dish"
> >
> > so to extract the next Unicode character from a string s, all I have 
> to do is
> >
> > s|start[s|1]
> >
> > Bingo!
> >
> > The new ixml is not online yet: just running the regression tests.
> >
> > Steven
> >
> >
>

Received on Tuesday, 16 August 2022 02:45:46 UTC