Re: ixampl goes Unicode from Steven Pemberton on 2022-08-24 (public-ixml@w3.org from August 2022)

From: Steven Pemberton <steven.pemberton@cwi.nl>
Date: Wed, 24 Aug 2022 18:01:27 +0000
To: M Joel Dubinko <micah@dubinko.info>, public-ixml@w3.org
Message-Id: <1661363984017.3219672388.3253413855@cwi.nl>

BTW, this is the code for the next version, which isn't live yet, so if you 
try my live version out, it will behave slightly differently.
Steven

On Wednesday 24 August 2022 16:42:48 (+02:00), Steven Pemberton wrote:

 > Hey Joel,
 > Enclosed are the sources for the bootstrap parser, and the ixml parser 
in ABC.
 >
 > I'm leaving on travels tomorrow: I didn't quite get the sources into a 
state that I am willing to publish, but they are very close.
 >
 > The main things still needing to be fixed: Since the introduction of 
Unicode, some error messages give the wrong position in the line for the 
error; and I need to clean up the handling of newlines in the serialisation 
code. And I still have to add the Unicode character classes of course.
 >
 > Anyway, if you have any questions, don't hesitate to ask.
 >
 > Best wishes,
 >
 > Steven
 >
 >
 > On Tuesday 16 August 2022 04:45:30 (+02:00), M Joel Dubinko wrote:
 >
 > > Steven,
 > >
 > > If this hasn't been written up anywhere, it would be great as a very 
short paper. :)
 > >
 > > Do you have a separate check for the illegal characters?
 > >
 > > j
 > >
 > >
 > > P.S. I'd love to see the Unicode classes (and generally, the entire 
ABC implementation) when you have a chance.
 > >
 > >
 > > On 8/15/22 5:39 PM, Steven Pemberton wrote:
 > > > It is now live.
 > > > I haven't yet updated the Unicode character classes though.
 > > >
 > > > Steven
 > > >
 > > > On Monday 15 August 2022 18:07:44 (+02:00), Steven Pemberton wrote:
 > > >
 > > > > A weird thing happened yesterday, quite unexpected (to me): I got 
ixampl working with Unicode characters.
 > > > > I'd never thought it possible, because ABC has only 8 bit 
characters, and they are atomic: no bit operations, no conversion 
functions, and UTF-8 is always described in terms of bit patterns.
 > > > >
 > > > > And then yesterday, I had a brainwave. There are only 256 bytes. 
128 of them are ASCII, and they just represent themselves (that's the 
reason UTF-8 exists).
 > > > >
 > > > > Of the other non-ASCII characters, they all play a single role in 
any UTF-8 string:
 > > > >
 > > > > [#C0-#DF] are leading bytes of a 2 byte character
 > > > > [#E0-#EF] are leading bytes of a 3 byte character
 > > > > [#F0-#F7] are leading bytes of a 4 byte character.
 > > > > [#80-#BF] are continuation bytes of the multibyte characters,
 > > > > and [#F8-#FF] are illegal.
 > > > >
 > > > > What this meant was that I could make a 256 long byte array, 
start, where each entry describes that role: 0 for continuations, 1 for 
ASCII, 2 for leading byte of 2 byte characters and so on for 3 and 4.
 > > > >
 > > > > In ABC the | operator delivers the first n bytes of a string
 > > > >
 > > > > "dishonest" | 4 = "dish"
 > > > >
 > > > > so to extract the next Unicode character from a string s, all I 
have to do is
 > > > >
 > > > > s|start[s|1]
 > > > >
 > > > > Bingo!
 > > > >
 > > > > The new ixml is not online yet: just running the regression tests.
 > > > >
 > > > > Steven
 > > > >
 > > > >
 > > >
 > >
 > >
 >

Received on Wednesday, 24 August 2022 18:01:47 UTC