Re: Obfuscating CJK Codepoint Requests and Reducing Set Encoding Sizes

I’ll let Garret speak to his idea, but I was thinking of the general frequency-based approach suggested in the last point.

> On Aug 5, 2019, at 7:10 PM, Myles C. Maxfield <mmaxfield@apple.com> wrote:
> 
> It sounds reasonable to me, though I wonder how it performs compared to the alternative of requesting a bunch of extra individual characters. Depending on how big the block is, there could be a whole lot of waste if the necessary characters are spread out over many blocks.
> 
> Are you envisioning a block would contain adjacent code points, or would we be sorting Unicode into blocks of characters which are used frequently together, regardless of the numeric values of their code points? Would all the blocks be the same size?
> 
>> On Aug 5, 2019, at 7:02 PM, Ned Holbrook <ned@apple.com <mailto:ned@apple.com>> wrote:
>> 
>> I think this sounds reasonable. [As a side note, it reminds me of the technique of decreasing the accuracy of geolocation with population in order to maximize the confidentiality of individuals.]
>> 
>>> On Aug 5, 2019, at 3:59 PM, Garret Rieger <grieger@google.com <mailto:grieger@google.com>> wrote:
>>> 
>>> I've got an idea that could potentially help out on both the obfuscating requested characters for ideographic scripts and reducing the encoding costs for code point sets: what if for those fonts we bundled low frequency codepoints into a series of blocks. Client requests would ask for the blocks instead of specific code points. High frequency characters (say roughly the first couple thousand or so)  could still be requested individually.
>>> 
>>> My reasoning is this:
>>> The presence of characters that occur sufficiently frequently enough won't encode much if any useful information as to the contents of a specific page. So it's likely safe to request those individually.
>>> For low frequency characters, requests for them are now obfuscated since all you know is that the client needed at least one of the codepoints in a specific block.
>>> In our work for segmenting CJK fonts into unicode ranges we found that our low frequencies blocks are very rarely downloaded by clients. To the point where the cost of downloading these low frequency blocks had a negligible impact on the number of overall bytes users were downloading. So even though every once in a while you may need to grab a block of codepoints just to get one, over the long term that extra cost is nearly negligible.
>>> Unlike with our unicode range approach we don't incur over head with small block sizes so we could use blocks of 10-50 codepoints instead of blocks of 250 code points. That should keep the penalty of getting a whole block for one character pretty low.
>>> With the frequency remapping code point set encoding approach, most of the cost of encoding a set is in representing code points in the low frequency area. By blocking up that space the cost of encoding sets should come down by a factor of the block size (10-50x improvement if we're using blocks of 10-50 codepoints).
>>> If obfuscation even for high frequency codepoints is desired then we could use block sizes which are inversely proportional to the frequency (high frequency codepoints use small blocks, low frequency codepoints use larger blocks).
>>> I'm curious what you think? Does this seem like a reasonable approach?
>> 
> 

Received on Tuesday, 6 August 2019 02:17:09 UTC