- From: Steven Pemberton <steven.pemberton@cwi.nl>
- Date: Wed, 24 Aug 2022 18:01:27 +0000
- To: M Joel Dubinko <micah@dubinko.info>, public-ixml@w3.org
BTW, this is the code for the next version, which isn't live yet, so if you try my live version out, it will behave slightly differently. Steven On Wednesday 24 August 2022 16:42:48 (+02:00), Steven Pemberton wrote: > Hey Joel, > Enclosed are the sources for the bootstrap parser, and the ixml parser in ABC. > > I'm leaving on travels tomorrow: I didn't quite get the sources into a state that I am willing to publish, but they are very close. > > The main things still needing to be fixed: Since the introduction of Unicode, some error messages give the wrong position in the line for the error; and I need to clean up the handling of newlines in the serialisation code. And I still have to add the Unicode character classes of course. > > Anyway, if you have any questions, don't hesitate to ask. > > Best wishes, > > Steven > > > On Tuesday 16 August 2022 04:45:30 (+02:00), M Joel Dubinko wrote: > > > Steven, > > > > If this hasn't been written up anywhere, it would be great as a very short paper. :) > > > > Do you have a separate check for the illegal characters? > > > > j > > > > > > P.S. I'd love to see the Unicode classes (and generally, the entire ABC implementation) when you have a chance. > > > > > > On 8/15/22 5:39 PM, Steven Pemberton wrote: > > > It is now live. > > > I haven't yet updated the Unicode character classes though. > > > > > > Steven > > > > > > On Monday 15 August 2022 18:07:44 (+02:00), Steven Pemberton wrote: > > > > > > > A weird thing happened yesterday, quite unexpected (to me): I got ixampl working with Unicode characters. > > > > I'd never thought it possible, because ABC has only 8 bit characters, and they are atomic: no bit operations, no conversion functions, and UTF-8 is always described in terms of bit patterns. > > > > > > > > And then yesterday, I had a brainwave. There are only 256 bytes. 128 of them are ASCII, and they just represent themselves (that's the reason UTF-8 exists). > > > > > > > > Of the other non-ASCII characters, they all play a single role in any UTF-8 string: > > > > > > > > [#C0-#DF] are leading bytes of a 2 byte character > > > > [#E0-#EF] are leading bytes of a 3 byte character > > > > [#F0-#F7] are leading bytes of a 4 byte character. > > > > [#80-#BF] are continuation bytes of the multibyte characters, > > > > and [#F8-#FF] are illegal. > > > > > > > > What this meant was that I could make a 256 long byte array, start, where each entry describes that role: 0 for continuations, 1 for ASCII, 2 for leading byte of 2 byte characters and so on for 3 and 4. > > > > > > > > In ABC the | operator delivers the first n bytes of a string > > > > > > > > "dishonest" | 4 = "dish" > > > > > > > > so to extract the next Unicode character from a string s, all I have to do is > > > > > > > > s|start[s|1] > > > > > > > > Bingo! > > > > > > > > The new ixml is not online yet: just running the regression tests. > > > > > > > > Steven > > > > > > > > > > > > > > > >
Received on Wednesday, 24 August 2022 18:01:47 UTC