- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Thu, 18 Aug 2022 16:40:48 -0600
- To: Steven Pemberton <steven.pemberton@cwi.nl>
- Cc: public-ixml@w3.org

Steven Pemberton <steven.pemberton@cwi.nl> writes: >> Whenever I have done anything of this kind I have simply loaded a copy >> of some version of the Unicode Character Database and looked. But I >> like your model of range checks plus exception checks. > But with more than 130,000 characters in class L, I am little inclined > to load a whole database if it can be encoded more frugally. Understood; when I have done things of this kind it has generally been in programs to be run once every few months at most, and possibly not more than once every couple of years. But from the wording of your mail, one might get the impression that you are constructing your data structures by hand (I'm not sure, but to me it sounded that way); I hope that in fact you are extracting it automatically from the UCD. (I did construct the definitions of NameStartChar and NameChar in the XML spec by hand from the printed version of the Unicode 2.0 spec, since I had no electronic form of the information; it was a miserable experience I would not recommend to anyone. James Clark wisely acquired an electronic copy of the information and constructed a more reliable set of ranges.) >> I suppose one could view it as an optimization problem: given a >> particular distribution of properties, what formulation as ranges + >> subtractions + additions will minimize >> >> (a) the overall size of the representation, or >> (b) the expected cost of lookup > > This is indeed what I am trying to achieve, and wondered if anyone > else had attempted it before I put the work in myself... Thinking about it a bit just now, I think you may do better with a simpler data structure just listing the ranges. Where the beginning of class L in the ranges + subtractions + additions data structure is {(192, 705)}, {215; 247}, {170; 181; 186} a ranges-only version would be {(170, 170); (181, 181); (186, 186); (192, 214); (216, 246); (248, 705)} Rationale: First, for space reasons, we can't (or don't want to) use a simple table lookup or a has table for this. Three linked lists One possible implementation of the ranges/subtractions/additions idea would be three sorted lists R, S, and A. If we assume an equiprobable random distribution of characters in the input, then finding out whether the character is in one of the ranges in R will require examining |R|/2 entries in R if C is in some range, and |R| entries if C is not in a range. Each examination must compare the character being looked up (I'll call it C) to the min and max of the range, so the expected number of comparisons is |R| or 2|R|. If an enclosing range is found, we must then examine S, with one comparison per list item, so an expected cost of |S|/2 comparisons if C is present in S and |S| otherwise. Since in a random case C will be absent more frequently than it's present, the expected cost will be closer to |S| than to |S|/2. Total expected cost for this case: |R|/2 + |S|. For the alternative case (no range found, so A is searched) the expected cost is |R| + |A|. Three binary trees Better performance would follow from making R, S, and A balanced binary trees. For N nodes, the tree will have a depth of log2(N), a failed search will go all the way to the leaves and cost O(log2(N)) and a successful search will go halfway to the leaves and cost O(log2(N)/2) = O(log2(N)). (Unless I am mistaken, you can get the same performance from using three sorted arrays.) The expected number of comparisons for the first case is then 2 * log2(|R|)/2 + log2(|S|) = log2(|R|) + log2(|S|) The expected number of comparisons for the second case would be 2 * log2(|R|) + log2(|A|) I am assuming here that A and S will each have at least one entry. Single binary tree of ranges The alternative I would propose is not to maintain R, S, and A as separate trees but to maintain just one tree with an expanded set of ranges. Call it X. If we combine R and A, then X will contain every range in R plus one range for every individual character in A, so |X| = |R| + |A|. If we then integrate S into S, every individual character in S will split some range in R and add exactly one range. (Possible exception: if two member of S are adjacent, they won't add two ranges.) So |X| = |R| + |A| + |S|. For a successful search (i.e. (C not in R and C in A) or (C in R and C not in S)), the expected cost will be log2(|X|)/2 = log2(|R| + |A| + |S|)/2 = log2( (|R| + |A| + |S|) ** 0.5) compared to 2 * log2(|R|) + log2(|A|)/2 = log2(|R|**2) + log2(|A|**0.5) = log2(|R|**2 * |A|**0.5) or log2(|R|) + log2(|S|) = log2(|R| * |S|) The three-tree version will be faster than the single-tree version just in case either |R|+|A|+|S| > |R|**4 * |A| or |R|+|A|+|S| > |R|**2 * |S|**2 I think this is arithmetically possible, but unlikely. For an unsuccessful search ((C in R and C in S) or (C not in R and C not in A), the expected cost is log2(|X|) = log2(|R| + |A| + |S|) compared to 2*log2(|R|)/2 + log2(|S|)/2 = log2(|R|) + log2(|S|**0.5) = log2(|R| * |S|**0.5) or 2*log2(|R|) + log2(|A|) = log2(|R|**2 * |A|) The three-tree version will be faster than the single-tree version just in case |R|+|A|+|S| is greater than either |R| * sqrt(|S|) or |R|**2 * |A|. Weight-balanced trees In practice, the code points of the Universal Character Set are probably not equiprobable in your input; if you can estimate the relative probabilities of the different characters, then you can make what my book on algorithms and data structures calls a weight-balanced tree (basic idea: the nodes in the left branch have the same cumulative probability as the nodes in the right branch). -- C. M. Sperberg-McQueen Black Mesa Technologies LLC http://blackmesatech.com

Received on Friday, 19 August 2022 00:24:39 UTC