Re: ixampl goes Unicode from Steven Pemberton on 2022-08-18 (public-ixml@w3.org from August 2022)

From: Steven Pemberton <steven.pemberton@cwi.nl>
Date: Thu, 18 Aug 2022 18:12:59 +0000
To: ixml <public-ixml@w3.org>
Message-Id: <1660841415533.2827106937.3009567804@cwi.nl>

> It is now live.
> I haven't yet updated the Unicode character classes though.
Well, I'm slowly adding them, with the priority being classes L and Mn which are both used in the ixml grammar.

What a pain though! It's as if the Unicode design committee put no thought into it at all. For instance c0-ff are all letters EXCEPT they've stuck the multiply sign × in the middle, and the divide sign ÷ somewhere else in the middle.

And then the Roman alphabet (in ASCII) has the lowercase letters in one range, and the upper case in another. But the Latin range 100-17E has them alternating (upper, lower)* EXCEPT at #138 they stick an orphaned character, and then at #149 they do it again.

ʭ! (That's the IPA letter for audible teeth gnashing)

Trying to minimise the encoding of this madness I've decided to use a trio of sets: ranges, exceptions, and additions. 

For instance, the lower end of class L is

 {(192, 705)}, {215; 247}, {170; 181; 186}

and the lower end of Mn is

 {(768, 879); (1155, 1159); (1425, 1469)}, {}, {}


Anybody else had to deal with this nightmare, and have a better encoding?

Steven

Received on Thursday, 18 August 2022 18:13:21 UTC