- From: Adam Retter <adam@exist-db.org>
- Date: Thu, 18 Aug 2022 12:36:34 +0100
- To: Steven Pemberton <steven.pemberton@cwi.nl>
- Cc: M Joel Dubinko <micah@dubinko.info>, ixml <public-ixml@w3.org>
- Message-ID: <CAJKLP9Y2B_rhUKDY00o-zDzaaz7zATTM5ru-bwwYK5x-dCsrWg@mail.gmail.com>
Doh! Of course! On Thu, 18 Aug 2022, 12:22 Steven Pemberton, <steven.pemberton@cwi.nl> wrote: > > > On Wednesday 17 August 2022 23:54:56 (+02:00), Adam Retter wrote: > > As it's streamable to validate you need a maximum buffer of byte[3], > although I suspect you can for tye purposes of pure validation get away > with a buffer of byte[1]. the validation itself is a simple bit pattern > match and is trivial - e.g. GitHub.com/digitalpreservation/utf8-validator > > That was exactly the problem: I'm using a programming language that > doesn't have the concept of 'bit pattern'. > That's why I used my solution (and needed to discover it first, because > all solutions talk in terms of bit patterns). > > Best wishes, > > Steven > > > On Tue, 16 Aug 2022, 08:20 Steven Pemberton, <steven.pemberton@cwi.nl> > wrote: > >> > If this hasn't been written up anywhere, it would be great as a very >> short paper. :) >> Any suggestion of where it would be suitable to submit to? >> >> BTW, one extra nice tidbit: the bytes of a Unicode character are just a >> base64-encoding of its codepoint. Each byte is one digit, and the value >> of >> each digit is just its position in its range. So for instance #C1 is in >> the >> range [#C0-#DF], and so has value 1. >> >> > Do you have a separate check for the illegal characters? >> >> I don't, and I wouldn't know what to do with one either. I don't have a >> check for a string beginning with a continuation byte either. I just >> assume >> the stream is already good. >> >> > P.S. I'd love to see the Unicode classes >> >> https://www.fileformat.info/info/unicode/category/index.htm >> >> > (and generally, the entire ABC implementation) when you have a chance. >> >> Actually this whole exercise resulted from you asking for the sources, >> and >> me tidying them up for release... Give me a day or so. >> >> Best wishes, >> >> Steven >> >> > >> > >> > On 8/15/22 5:39 PM, Steven Pemberton wrote: >> > > It is now live. >> > > I haven't yet updated the Unicode character classes though. >> > > >> > > Steven >> > > >> > > On Monday 15 August 2022 18:07:44 (+02:00), Steven Pemberton wrote: >> > > >> > > > A weird thing happened yesterday, quite unexpected (to me): I got >> ixampl working with Unicode characters. >> > > > I'd never thought it possible, because ABC has only 8 bit >> characters, and they are atomic: no bit operations, no conversion >> functions, and UTF-8 is always described in terms of bit patterns. >> > > > >> > > > And then yesterday, I had a brainwave. There are only 256 bytes. >> 128 >> of them are ASCII, and they just represent themselves (that's the reason >> UTF-8 exists). >> > > > >> > > > Of the other non-ASCII characters, they all play a single role in >> any UTF-8 string: >> > > > >> > > > [#C0-#DF] are leading bytes of a 2 byte character >> > > > [#E0-#EF] are leading bytes of a 3 byte character >> > > > [#F0-#F7] are leading bytes of a 4 byte character. >> > > > [#80-#BF] are continuation bytes of the multibyte characters, >> > > > and [#F8-#FF] are illegal. >> > > > >> > > > What this meant was that I could make a 256 long byte array, >> start, >> where each entry describes that role: 0 for continuations, 1 for ASCII, 2 >> for leading byte of 2 byte characters and so on for 3 and 4. >> > > > >> > > > In ABC the | operator delivers the first n bytes of a string >> > > > >> > > > "dishonest" | 4 = "dish" >> > > > >> > > > so to extract the next Unicode character from a string s, all I >> have >> to do is >> > > > >> > > > s|start[s|1] >> > > > >> > > > Bingo! >> > > > >> > > > The new ixml is not online yet: just running the regression tests. >> > > > >> > > > Steven >> > > > >> > > > >> > > >> > >> > >> >>
Received on Thursday, 18 August 2022 11:37:00 UTC