- From: Steven Pemberton <steven.pemberton@cwi.nl>
- Date: Thu, 18 Aug 2022 18:12:59 +0000
- To: ixml <public-ixml@w3.org>
> It is now live. > I haven't yet updated the Unicode character classes though. Well, I'm slowly adding them, with the priority being classes L and Mn which are both used in the ixml grammar. What a pain though! It's as if the Unicode design committee put no thought into it at all. For instance c0-ff are all letters EXCEPT they've stuck the multiply sign × in the middle, and the divide sign ÷ somewhere else in the middle. And then the Roman alphabet (in ASCII) has the lowercase letters in one range, and the upper case in another. But the Latin range 100-17E has them alternating (upper, lower)* EXCEPT at #138 they stick an orphaned character, and then at #149 they do it again. ʭ! (That's the IPA letter for audible teeth gnashing) Trying to minimise the encoding of this madness I've decided to use a trio of sets: ranges, exceptions, and additions. For instance, the lower end of class L is {(192, 705)}, {215; 247}, {170; 181; 186} and the lower end of Mn is {(768, 879); (1155, 1159); (1425, 1469)}, {}, {} Anybody else had to deal with this nightmare, and have a better encoding? Steven
Received on Thursday, 18 August 2022 18:13:21 UTC