- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Tue, 16 Aug 2022 10:54:31 -0600
- To: Steven Pemberton <steven.pemberton@cwi.nl>
- Cc: public-ixml@w3.org
Steven Pemberton <steven.pemberton@cwi.nl> writes: >> And then yesterday, I had a brainwave. There are only 256 bytes. 128 of them are ASCII, and they just represent themselves (that's the reason UTF-8 exists). >> >> Of the other non-ASCII characters, they all play a single role in any UTF-8 string: >> >> [#C0-#DF] are leading bytes of a 2 byte character >> [#E0-#EF] are leading bytes of a 3 byte character >> [#F0-#F7] are leading bytes of a 4 byte character. >> [#80-#BF] are continuation bytes of the multibyte characters, >> and [#F8-#FF] are illegal. > > I realised under the shower this morning that this means that Unicode > is context-free. I think you mean UTF-8. Actually, since there a finite number of UCS characters, and only one legal UTF-8 encoding per character, the language of UTF-8 is presumably regular. > Using a byte-oriented input stream, you can describe Unicode as: > unicode: char*. > char: ascii; > h2, c; > h3, c, c. > h4, c, c, c. > ascii: [#0-#7f]. > h2: [#c0-#df]. > h3: [#e0-#ef]. > h4: [#f0-#f7]. > c: [#80-#bf]. Or, in a single expression: utf8: ( [#0-#7f] ; [#c0-#df], [#80-#bf] ; [#e0-#ef], [#80-#bf], [#80-#bf] ; [#f0-#f7], [#80-#bf], [#80-#bf], [#80-#bf] )*. I think this is a nice example to show that it will often be simpler to describe regular languages using ixml than using regular expressions, and the result can often be easier to understand. Michael -- C. M. Sperberg-McQueen Black Mesa Technologies LLC http://blackmesatech.com
Received on Tuesday, 16 August 2022 17:01:35 UTC