Re: ixampl goes Unicode

As it's streamable to validate you need a maximum buffer of byte[3],
although I suspect you can for tye purposes of pure validation get away
with a buffer of byte[1]. the validation itself is a simple bit pattern
match and is trivial - e.g. GitHub.com/digitalpreservation/utf8-validator

On Tue, 16 Aug 2022, 08:20 Steven Pemberton, <steven.pemberton@cwi.nl>
wrote:

>  > If this hasn't been written up anywhere, it would be great as a very
> short paper. :)
> Any suggestion of where it would be suitable to submit to?
>
> BTW, one extra nice tidbit: the bytes of a Unicode character are just a
> base64-encoding of its codepoint. Each byte is one digit, and the value of
> each digit is just its position in its range. So for instance #C1 is in
> the
> range [#C0-#DF], and so has value 1.
>
>  > Do you have a separate check for the illegal characters?
>
> I don't, and I wouldn't know what to do with one either. I don't have a
> check for a string beginning with a continuation byte either. I just
> assume
> the stream is already good.
>
>  > P.S. I'd love to see the Unicode classes
>
> https://www.fileformat.info/info/unicode/category/index.htm
>
>  > (and generally, the entire ABC implementation) when you have a chance.
>
> Actually this whole exercise resulted from you asking for the sources, and
> me tidying them up for release... Give me a day or so.
>
> Best wishes,
>
> Steven
>
>  >
>  >
>  > On 8/15/22 5:39 PM, Steven Pemberton wrote:
>  > > It is now live.
>  > > I haven't yet updated the Unicode character classes though.
>  > >
>  > > Steven
>  > >
>  > > On Monday 15 August 2022 18:07:44 (+02:00), Steven Pemberton wrote:
>  > >
>  > > > A weird thing happened yesterday, quite unexpected (to me): I got
> ixampl working with Unicode characters.
>  > > > I'd never thought it possible, because ABC has only 8 bit
> characters, and they are atomic: no bit operations, no conversion
> functions, and UTF-8 is always described in terms of bit patterns.
>  > > >
>  > > > And then yesterday, I had a brainwave. There are only 256 bytes.
> 128
> of them are ASCII, and they just represent themselves (that's the reason
> UTF-8 exists).
>  > > >
>  > > > Of the other non-ASCII characters, they all play a single role in
> any UTF-8 string:
>  > > >
>  > > > [#C0-#DF] are leading bytes of a 2 byte character
>  > > > [#E0-#EF] are leading bytes of a 3 byte character
>  > > > [#F0-#F7] are leading bytes of a 4 byte character.
>  > > > [#80-#BF] are continuation bytes of the multibyte characters,
>  > > > and [#F8-#FF] are illegal.
>  > > >
>  > > > What this meant was that I could make a 256 long byte array, start,
> where each entry describes that role: 0 for continuations, 1 for ASCII, 2
> for leading byte of 2 byte characters and so on for 3 and 4.
>  > > >
>  > > > In ABC the | operator delivers the first n bytes of a string
>  > > >
>  > > > "dishonest" | 4 = "dish"
>  > > >
>  > > > so to extract the next Unicode character from a string s, all I
> have
> to do is
>  > > >
>  > > > s|start[s|1]
>  > > >
>  > > > Bingo!
>  > > >
>  > > > The new ixml is not online yet: just running the regression tests.
>  > > >
>  > > > Steven
>  > > >
>  > > >
>  > >
>  >
>  >
>
>

Received on Wednesday, 17 August 2022 21:55:22 UTC