- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Tue, 16 Aug 2022 10:54:31 -0600
- To: Steven Pemberton <steven.pemberton@cwi.nl>
- Cc: public-ixml@w3.org
Steven Pemberton <steven.pemberton@cwi.nl> writes:
>> And then yesterday, I had a brainwave. There are only 256 bytes. 128 of them are ASCII, and they just represent themselves (that's the reason UTF-8 exists).
>>
>> Of the other non-ASCII characters, they all play a single role in any UTF-8 string:
>>
>> [#C0-#DF] are leading bytes of a 2 byte character
>> [#E0-#EF] are leading bytes of a 3 byte character
>> [#F0-#F7] are leading bytes of a 4 byte character.
>> [#80-#BF] are continuation bytes of the multibyte characters,
>> and [#F8-#FF] are illegal.
>
> I realised under the shower this morning that this means that Unicode
> is context-free.
I think you mean UTF-8.
Actually, since there a finite number of UCS characters, and only one
legal UTF-8 encoding per character, the language of UTF-8 is presumably
regular.
> Using a byte-oriented input stream, you can describe Unicode as:
> unicode: char*.
> char: ascii;
> h2, c;
> h3, c, c.
> h4, c, c, c.
> ascii: [#0-#7f].
> h2: [#c0-#df].
> h3: [#e0-#ef].
> h4: [#f0-#f7].
> c: [#80-#bf].
Or, in a single expression:
utf8: ( [#0-#7f]
; [#c0-#df], [#80-#bf]
; [#e0-#ef], [#80-#bf], [#80-#bf]
; [#f0-#f7], [#80-#bf], [#80-#bf], [#80-#bf]
)*.
I think this is a nice example to show that it will often be simpler to
describe regular languages using ixml than using regular expressions,
and the result can often be easier to understand.
Michael
--
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com
Received on Tuesday, 16 August 2022 17:01:35 UTC