Obfuscating CJK Codepoint Requests and Reducing Set Encoding Sizes

I've got an idea that could potentially help out on both the obfuscating
requested characters for ideographic scripts and reducing the encoding
costs for code point sets: what if for those fonts we bundled low frequency
codepoints into a series of blocks. Client requests would ask for the
blocks instead of specific code points. High frequency characters (say
roughly the first couple thousand or so)  could still be requested
individually.

My reasoning is this:

   - The presence of characters that occur sufficiently frequently enough
   won't encode much if any useful information as to the contents of a
   specific page. So it's likely safe to request those individually.
   - For low frequency characters, requests for them are now obfuscated
   since all you know is that the client needed at least one of the codepoints
   in a specific block.
   - In our work for segmenting CJK fonts into unicode ranges we found that
   our low frequencies blocks are very rarely downloaded by clients. To the
   point where the cost of downloading these low frequency blocks had a
   negligible impact on the number of overall bytes users were downloading. So
   even though every once in a while you may need to grab a block of
   codepoints just to get one, over the long term that extra cost is nearly
   negligible.
   - Unlike with our unicode range approach we don't incur over head with
   small block sizes so we could use blocks of 10-50 codepoints instead of
   blocks of 250 code points. That should keep the penalty of getting a whole
   block for one character pretty low.
   - With the frequency remapping code point set encoding approach, most of
   the cost of encoding a set is in representing code points in the low
   frequency area. By blocking up that space the cost of encoding sets should
   come down by a factor of the block size (10-50x improvement if we're using
   blocks of 10-50 codepoints).
   - If obfuscation even for high frequency codepoints is desired then we
   could use block sizes which are inversely proportional to the frequency
   (high frequency codepoints use small blocks, low frequency codepoints use
   larger blocks).

I'm curious what you think? Does this seem like a reasonable approach?

Received on Monday, 5 August 2019 23:00:29 UTC