Re: ixampl goes Unicode from C. M. Sperberg-McQueen on 2022-08-19 (public-ixml@w3.org from August 2022)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Thu, 18 Aug 2022 18:28:17 -0600
To: Steven Pemberton <steven.pemberton@cwi.nl>
Cc: public-ixml@w3.org
Message-ID: <87k075cb7p.fsf@blackmesatech.com>

Steven Pemberton <steven.pemberton@cwi.nl> writes:

>> Whenever I have done anything of this kind I have simply loaded a copy
>> of some version of the Unicode Character Database and looked. But I
>> like your model of range checks plus exception checks.
> But with more than 130,000 characters in class L, I am little inclined
> to load a whole database if it can be encoded more frugally.
>
>> I suppose one could view it as an optimization problem: given a
>> particular distribution of properties, what formulation as ranges +
>> subtractions + additions will minimize
>>
>> (a) the overall size of the representation, or
>> (b) the expected cost of lookup
>
> This is indeed what I am trying to achieve, and wondered if anyone
> else had attempted it before I put the work in myself...

Hmm.  Wait a second ...

I did do some work a while back on extracting lists of ranges and
characters for different character classes.  I did not use the Unicode
character tables, relying instead of the support for regex matching
based on character classes built into XQuery engines.  I also did not
use the ranges + subtractions + additions mechanism you are working
with, but what I did could be used for a ranges + additions approach, if
you're not persuaded by my earlier mail that a ranges-only approach will
produce faster results, since log2(a) + log2(b) = log2(a * b), which is
almost certain to be more than log2(a + b).

See [1] for an XQuery module that generates an XML document with
information about selected classes, and [2] for its output on a
selection of eight such classes (the ones I needed when I wrote it, I
guess).

[1] https://github.com/cmsmcq/gingersnap/blob/main/src/class-codes-to-ranges.xq
[2] https://github.com/cmsmcq/gingersnap/blob/main/src/unicode-classes.xml

As you can see if you look at the XML, for each class specified the
query returns an XML inclusion element listing the class and giving its
members first in the form of a list of ranges (itself in the form of a
flat sequence of integers) and then in the form of a sequence of
'literal' and 'range' elements.  (The inclusion element would be valid
against the then applicable schema for ixml grammars; maybe I was
generating these for inclusion in a grammar, or maybe it just seemed a
reasonable notation.)

From this representation you could if you wished generate either a
binary tree of ranges or separate binary trees for ranges and
additions.  (No clear representation here for subtractions.)

If this helps, you are welcome to use it.

Michael

-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

Received on Friday, 19 August 2022 00:46:04 UTC