Re: ixampl goes Unicode from C. M. Sperberg-McQueen on 2022-08-16 (public-ixml@w3.org from August 2022)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Tue, 16 Aug 2022 10:54:31 -0600
To: Steven Pemberton <steven.pemberton@cwi.nl>
Cc: public-ixml@w3.org
Message-ID: <87k078dsx1.fsf@blackmesatech.com>

Steven Pemberton <steven.pemberton@cwi.nl> writes:

>> And then yesterday, I had a brainwave. There are only 256 bytes. 128 of them are ASCII, and they just represent themselves (that's the reason UTF-8 exists).
>> 
>> Of the other non-ASCII characters, they all play a single role in any UTF-8 string:
>> 
>> [#C0-#DF] are leading bytes of a 2 byte character
>> [#E0-#EF] are leading bytes of a 3 byte character
>> [#F0-#F7] are leading bytes of a 4 byte character.
>> [#80-#BF] are continuation bytes of the multibyte characters,
>> and [#F8-#FF] are illegal.
>
> I realised under the shower this morning that this means that Unicode
> is context-free.

I think you mean UTF-8.

Actually, since there a finite number of UCS characters, and only one
legal UTF-8 encoding per character, the language of UTF-8 is presumably
regular.

> Using a byte-oriented input stream, you can describe Unicode as:

> 	unicode: char*.
> 	char: ascii; 
> 	      h2, c;
> 	      h3, c, c.
> 	      h4, c, c, c.
> 	ascii: [#0-#7f].
> 	h2: [#c0-#df].
> 	h3: [#e0-#ef].
> 	h4: [#f0-#f7].
> 	c:  [#80-#bf].

Or, in a single expression:

  utf8: ( [#0-#7f]
        ; [#c0-#df], [#80-#bf]
        ; [#e0-#ef], [#80-#bf], [#80-#bf]
        ; [#f0-#f7], [#80-#bf], [#80-#bf], [#80-#bf]
        )*.

I think this is a nice example to show that it will often be simpler to
describe regular languages using ixml than using regular expressions,
and the result can often be easier to understand.

Michael



-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

Received on Tuesday, 16 August 2022 17:01:35 UTC