Re: ixampl goes Unicode

Steven Pemberton <steven.pemberton@cwi.nl> writes:

>> Whenever I have done anything of this kind I have simply loaded a copy
>> of some version of the Unicode Character Database and looked. But I
>> like your model of range checks plus exception checks.

> But with more than 130,000 characters in class L, I am little inclined
> to load a whole database if it can be encoded more frugally.

Understood; when I have done things of this kind it has generally been
in programs to be run once every few months at most, and possibly not
more than once every couple of years.

But from the wording of your mail, one might get the impression that you
are constructing your data structures by hand (I'm not sure, but to me
it sounded that way); I hope that in fact you are extracting it
automatically from the UCD.  (I did construct the definitions of
NameStartChar and NameChar in the XML spec by hand from the printed
version of the Unicode 2.0 spec, since I had no electronic form of the
information; it was a miserable experience I would not recommend to
anyone.  James Clark wisely acquired an electronic copy of the
information and constructed a more reliable set of ranges.)

>> I suppose one could view it as an optimization problem: given a
>> particular distribution of properties, what formulation as ranges +
>> subtractions + additions will minimize
>>
>> (a) the overall size of the representation, or
>> (b) the expected cost of lookup
>
> This is indeed what I am trying to achieve, and wondered if anyone
> else had attempted it before I put the work in myself...

Thinking about it a bit just now, I think you may do better with a
simpler data structure just listing the ranges.  Where the beginning of
class L in the ranges + subtractions + additions data structure is

	{(192, 705)}, {215; 247}, {170; 181; 186}

a ranges-only version would be

        {(170, 170); (181, 181); (186, 186); (192, 214); (216, 246);
        (248, 705)}

Rationale: First, for space reasons, we can't (or don't want to) use a
simple table lookup or a has table for this.


Three linked lists

One possible implementation of the ranges/subtractions/additions idea
would be three sorted lists R, S, and A.  If we assume an equiprobable
random distribution of characters in the input, then finding out whether
the character is in one of the ranges in R will require examining |R|/2
entries in R if C is in some range, and |R| entries if C is not in a
range. Each examination must compare the character being looked up (I'll
call it C) to the min and max of the range, so the expected number of
comparisons is |R| or 2|R|.  If an enclosing range is found, we must
then examine S, with one comparison per list item, so an expected cost
of |S|/2 comparisons if C is present in S and |S| otherwise. Since in a
random case C will be absent more frequently than it's present, the
expected cost will be closer to |S| than to |S|/2.  Total expected cost
for this case:  |R|/2 + |S|.  For the alternative case (no range found,
so A is searched) the expected cost is |R| + |A|.


Three binary trees

Better performance would follow from making R, S, and A balanced binary
trees.  For N nodes, the tree will have a depth of log2(N), a failed
search will go all the way to the leaves and cost O(log2(N)) and a
successful search will go halfway to the leaves and cost O(log2(N)/2) =
O(log2(N)).  (Unless I am mistaken, you can get the same performance
from using three sorted arrays.)

The expected number of comparisons for the first case is then

    2 * log2(|R|)/2 + log2(|S|)
  = log2(|R|) + log2(|S|)

The expected number of comparisons for the second case would be

    2 * log2(|R|) + log2(|A|)

I am assuming here that A and S will each have at least one entry.


Single binary tree of ranges

The alternative I would propose is not to maintain R, S, and A as
separate trees but to maintain just one tree with an expanded set of
ranges.  Call it X.

If we combine R and A, then X will contain every range in R plus one
range for every individual character in A, so |X| = |R| + |A|.  If we
then integrate S into S, every individual character in S will split some
range in R and add exactly one range.  (Possible exception: if two
member of S are adjacent, they won't add two ranges.)

So |X| = |R| + |A| + |S|.

For a successful search (i.e. (C not in R and C in A) or (C in R and C
not in S)), the expected cost will be

    log2(|X|)/2
  = log2(|R| + |A| + |S|)/2
  = log2( (|R| + |A| + |S|) ** 0.5)

compared to

    2 * log2(|R|) + log2(|A|)/2
  = log2(|R|**2) + log2(|A|**0.5)
  = log2(|R|**2 * |A|**0.5)

or

    log2(|R|) + log2(|S|)
  = log2(|R| * |S|)

The three-tree version will be faster than the single-tree version just
in case either

    |R|+|A|+|S| > |R|**4 * |A|

or

    |R|+|A|+|S| > |R|**2 * |S|**2

I think this is arithmetically possible, but unlikely.

For an unsuccessful search ((C in R and C in S) or (C not in R and C not
in A), the expected cost is

    log2(|X|)
  = log2(|R| + |A| + |S|)

compared to

    2*log2(|R|)/2 + log2(|S|)/2
  = log2(|R|) + log2(|S|**0.5)
  = log2(|R| * |S|**0.5)

or

    2*log2(|R|) + log2(|A|)
  = log2(|R|**2 * |A|)

The three-tree version will be faster than the single-tree version just
in case |R|+|A|+|S| is greater than either |R| * sqrt(|S|) or |R|**2 *
|A|.


Weight-balanced trees

In practice, the code points of the Universal Character Set are probably
not equiprobable in your input; if you can estimate the relative
probabilities of the different characters, then you can make what my
book on algorithms and data structures calls a weight-balanced tree
(basic idea: the nodes in the left branch have the same cumulative
probability as the nodes in the right branch).




-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

Received on Friday, 19 August 2022 00:24:39 UTC