- From: M Joel Dubinko <micah@dubinko.info>
- Date: Mon, 15 Aug 2022 22:45:30 -0400
- To: public-ixml@w3.org
Steven, If this hasn't been written up anywhere, it would be great as a very short paper. :) Do you have a separate check for the illegal characters? j P.S. I'd love to see the Unicode classes (and generally, the entire ABC implementation) when you have a chance. On 8/15/22 5:39 PM, Steven Pemberton wrote: > It is now live. > I haven't yet updated the Unicode character classes though. > > Steven > > On Monday 15 August 2022 18:07:44 (+02:00), Steven Pemberton wrote: > > > A weird thing happened yesterday, quite unexpected (to me): I got > ixampl working with Unicode characters. > > I'd never thought it possible, because ABC has only 8 bit > characters, and they are atomic: no bit operations, no conversion > functions, and UTF-8 is always described in terms of bit patterns. > > > > And then yesterday, I had a brainwave. There are only 256 bytes. 128 > of them are ASCII, and they just represent themselves (that's the > reason UTF-8 exists). > > > > Of the other non-ASCII characters, they all play a single role in > any UTF-8 string: > > > > [#C0-#DF] are leading bytes of a 2 byte character > > [#E0-#EF] are leading bytes of a 3 byte character > > [#F0-#F7] are leading bytes of a 4 byte character. > > [#80-#BF] are continuation bytes of the multibyte characters, > > and [#F8-#FF] are illegal. > > > > What this meant was that I could make a 256 long byte array, start, > where each entry describes that role: 0 for continuations, 1 for > ASCII, 2 for leading byte of 2 byte characters and so on for 3 and 4. > > > > In ABC the | operator delivers the first n bytes of a string > > > > "dishonest" | 4 = "dish" > > > > so to extract the next Unicode character from a string s, all I have > to do is > > > > s|start[s|1] > > > > Bingo! > > > > The new ixml is not online yet: just running the regression tests. > > > > Steven > > > > >
Received on Tuesday, 16 August 2022 02:45:46 UTC