- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Thu, 18 Aug 2022 18:28:17 -0600
- To: Steven Pemberton <steven.pemberton@cwi.nl>
- Cc: public-ixml@w3.org
Steven Pemberton <steven.pemberton@cwi.nl> writes: >> Whenever I have done anything of this kind I have simply loaded a copy >> of some version of the Unicode Character Database and looked. But I >> like your model of range checks plus exception checks. > But with more than 130,000 characters in class L, I am little inclined > to load a whole database if it can be encoded more frugally. > >> I suppose one could view it as an optimization problem: given a >> particular distribution of properties, what formulation as ranges + >> subtractions + additions will minimize >> >> (a) the overall size of the representation, or >> (b) the expected cost of lookup > > This is indeed what I am trying to achieve, and wondered if anyone > else had attempted it before I put the work in myself... Hmm. Wait a second ... I did do some work a while back on extracting lists of ranges and characters for different character classes. I did not use the Unicode character tables, relying instead of the support for regex matching based on character classes built into XQuery engines. I also did not use the ranges + subtractions + additions mechanism you are working with, but what I did could be used for a ranges + additions approach, if you're not persuaded by my earlier mail that a ranges-only approach will produce faster results, since log2(a) + log2(b) = log2(a * b), which is almost certain to be more than log2(a + b). See [1] for an XQuery module that generates an XML document with information about selected classes, and [2] for its output on a selection of eight such classes (the ones I needed when I wrote it, I guess). [1] https://github.com/cmsmcq/gingersnap/blob/main/src/class-codes-to-ranges.xq [2] https://github.com/cmsmcq/gingersnap/blob/main/src/unicode-classes.xml As you can see if you look at the XML, for each class specified the query returns an XML inclusion element listing the class and giving its members first in the form of a list of ranges (itself in the form of a flat sequence of integers) and then in the form of a sequence of 'literal' and 'range' elements. (The inclusion element would be valid against the then applicable schema for ixml grammars; maybe I was generating these for inclusion in a grammar, or maybe it just seemed a reasonable notation.) From this representation you could if you wished generate either a binary tree of ranges or separate binary trees for ranges and additions. (No clear representation here for subtractions.) If this helps, you are welcome to use it. Michael -- C. M. Sperberg-McQueen Black Mesa Technologies LLC http://blackmesatech.com
Received on Friday, 19 August 2022 00:46:04 UTC