Re: ixampl goes Unicode

Steven Pemberton <steven.pemberton@cwi.nl> writes:

>> It is now live.
>> I haven't yet updated the Unicode character classes though.

> Well, I'm slowly adding them, with the priority being classes L and Mn
> which are both used in the ixml grammar.

> What a pain though! It's as if the Unicode design committee put no
> thought into it at all. For instance c0-ff are all letters EXCEPT
> they've stuck the multiply sign × in the middle, and the divide sign ÷
> somewhere else in the middle.

To be fair, those choices were made long before ISO 10646 and Unicode
existed:  those positions are assigned in ISO 8859-1.

> And then the Roman alphabet (in ASCII) has the lowercase letters in
> one range, and the upper case in another. But the Latin range 100-17E
> has them alternating (upper, lower)* EXCEPT at #138 they stick an
> orphaned character, and then at #149 they do it again.
>
> ʭ! (That's the IPA letter for audible teeth gnashing)
>
> Trying to minimise the encoding of this madness I've decided to use a trio of sets: ranges, exceptions, and additions. 
>
> For instance, the lower end of class L is
>
>  {(192, 705)}, {215; 247}, {170; 181; 186}
>
> and the lower end of Mn is
>
>  {(768, 879); (1155, 1159); (1425, 1469)}, {}, {}
>
>
> Anybody else had to deal with this nightmare, and have a better encoding?

Whenever I have done anything of this kind I have simply loaded a copy
of some version of the Unicode Character Database and looked.  But I
like your model of range checks plus exception checks.

I suppose one could view it as an optimization problem:  given a
particular distribution of properties, what formulation as ranges +
subtractions + additions will minimize

  (a) the overall size of the representation, or
  (b) the expected cost of lookup

?


-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

Received on Thursday, 18 August 2022 18:49:35 UTC