Re: ixampl goes Unicode from Steven Pemberton on 2022-08-24 (public-ixml@w3.org from August 2022)

From: Steven Pemberton <steven.pemberton@cwi.nl>
Date: Wed, 24 Aug 2022 14:42:48 +0000
To: M Joel Dubinko <micah@dubinko.info>, public-ixml@w3.org
Message-Id: <1661351805024.1142190254.2697893319@cwi.nl>

Hey Joel,
Enclosed are the sources for the bootstrap parser, and the ixml parser in 
ABC.

I'm leaving on travels tomorrow: I didn't quite get the sources into a 
state that I am willing to publish, but they are very close.

The main things still needing to be fixed: Since the introduction of 
Unicode, some error messages give the wrong position in the line for the 
error; and I need to clean up the handling of newlines in the serialisation 
code. And I still have to add the Unicode character classes of course.

Anyway, if you have any questions, don't hesitate to ask.

Best wishes,

Steven


On Tuesday 16 August 2022 04:45:30 (+02:00), M Joel Dubinko wrote:

 > Steven,
 >
 > If this hasn't been written up anywhere, it would be great as a very 
short paper. :)
 >
 > Do you have a separate check for the illegal characters?
 >
 > j
 >
 >
 > P.S. I'd love to see the Unicode classes (and generally, the entire ABC 
implementation) when you have a chance.
 >
 >
 > On 8/15/22 5:39 PM, Steven Pemberton wrote:
 > > It is now live.
 > > I haven't yet updated the Unicode character classes though.
 > >
 > > Steven
 > >
 > > On Monday 15 August 2022 18:07:44 (+02:00), Steven Pemberton wrote:
 > >
 > > > A weird thing happened yesterday, quite unexpected (to me): I got 
ixampl working with Unicode characters.
 > > > I'd never thought it possible, because ABC has only 8 bit 
characters, and they are atomic: no bit operations, no conversion 
functions, and UTF-8 is always described in terms of bit patterns.
 > > >
 > > > And then yesterday, I had a brainwave. There are only 256 bytes. 128 
of them are ASCII, and they just represent themselves (that's the reason 
UTF-8 exists).
 > > >
 > > > Of the other non-ASCII characters, they all play a single role in 
any UTF-8 string:
 > > >
 > > > [#C0-#DF] are leading bytes of a 2 byte character
 > > > [#E0-#EF] are leading bytes of a 3 byte character
 > > > [#F0-#F7] are leading bytes of a 4 byte character.
 > > > [#80-#BF] are continuation bytes of the multibyte characters,
 > > > and [#F8-#FF] are illegal.
 > > >
 > > > What this meant was that I could make a 256 long byte array, start, 
where each entry describes that role: 0 for continuations, 1 for ASCII, 2 
for leading byte of 2 byte characters and so on for 3 and 4.
 > > >
 > > > In ABC the | operator delivers the first n bytes of a string
 > > >
 > > > "dishonest" | 4 = "dish"
 > > >
 > > > so to extract the next Unicode character from a string s, all I have 
to do is
 > > >
 > > > s|start[s|1]
 > > >
 > > > Bingo!
 > > >
 > > > The new ixml is not online yet: just running the regression tests.
 > > >
 > > > Steven
 > > >
 > > >
 > >
 >
 >

Attachments

application/octet-stream attachment: ixml-parser.out
application/octet-stream attachment: abc-parser.out

Received on Wednesday, 24 August 2022 14:43:10 UTC