- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Thu, 18 Aug 2022 12:41:51 -0600
- To: Steven Pemberton <steven.pemberton@cwi.nl>
- Cc: public-ixml@w3.org
Steven Pemberton <steven.pemberton@cwi.nl> writes: >> It is now live. >> I haven't yet updated the Unicode character classes though. > Well, I'm slowly adding them, with the priority being classes L and Mn > which are both used in the ixml grammar. > What a pain though! It's as if the Unicode design committee put no > thought into it at all. For instance c0-ff are all letters EXCEPT > they've stuck the multiply sign × in the middle, and the divide sign ÷ > somewhere else in the middle. To be fair, those choices were made long before ISO 10646 and Unicode existed: those positions are assigned in ISO 8859-1. > And then the Roman alphabet (in ASCII) has the lowercase letters in > one range, and the upper case in another. But the Latin range 100-17E > has them alternating (upper, lower)* EXCEPT at #138 they stick an > orphaned character, and then at #149 they do it again. > > ʭ! (That's the IPA letter for audible teeth gnashing) > > Trying to minimise the encoding of this madness I've decided to use a trio of sets: ranges, exceptions, and additions. > > For instance, the lower end of class L is > > {(192, 705)}, {215; 247}, {170; 181; 186} > > and the lower end of Mn is > > {(768, 879); (1155, 1159); (1425, 1469)}, {}, {} > > > Anybody else had to deal with this nightmare, and have a better encoding? Whenever I have done anything of this kind I have simply loaded a copy of some version of the Unicode Character Database and looked. But I like your model of range checks plus exception checks. I suppose one could view it as an optimization problem: given a particular distribution of properties, what formulation as ranges + subtractions + additions will minimize (a) the overall size of the representation, or (b) the expected cost of lookup ? -- C. M. Sperberg-McQueen Black Mesa Technologies LLC http://blackmesatech.com
Received on Thursday, 18 August 2022 18:49:35 UTC