Re: Obfuscating CJK Codepoint Requests and Reducing Set Encoding Sizes from Garret Rieger on 2019-08-06 (public-webfonts-wg@w3.org from August 2019)

From: Garret Rieger <grieger@google.com>
Date: Tue, 6 Aug 2019 09:58:16 -0700
To: Ned Holbrook <ned@apple.com>
Cc: "Myles C. Maxfield" <mmaxfield@apple.com>, "w3c-webfonts-wg (public-webfonts-wg@w3.org)" <public-webfonts-wg@w3.org>
Message-ID: <CAM=OCWa3HmoVTvjUkK7oe0Cvqzkmh-OobAOGW+FQo_1Sko2pGQ@mail.gmail.com>
For our work on splitting CJK fonts into unicode range we experimented with
two different approaches:

   - For the low frequency blocks keep adjacent codepoints by value
   together in the same block;
   - Or sort all low frequency codepoints by frequency and then group into
   blocks.

What we found was that there wasn't any major  difference in the rate that
the low frequency blocks were being requested for either of the two
strategies and the first option only won out in overall size because it had
smaller CSS size due to the more compact unicode ranges. We haven't yet
tried a strategy where co-occurring characters are grouped into the same
blocks yet. I predict there's likely going to be some value in doing that.

For our usage in PFE the compactness of the ranges probably isn't nearly as
important so I could see either option working well. We have bigram
frequency data for CJK usage across the web so I could also attempt to use
that to define a grouping based around co-occurring characters. Ultimately
I think we'll need to test the different blocking strategies as part of the
page view analysis.

On Mon, Aug 5, 2019 at 7:16 PM Ned Holbrook <ned@apple.com> wrote:

> I’ll let Garret speak to his idea, but I was thinking of the general
> frequency-based approach suggested in the last point.
>
> On Aug 5, 2019, at 7:10 PM, Myles C. Maxfield <mmaxfield@apple.com> wrote:
>
> It sounds reasonable to me, though I wonder how it performs compared to
> the alternative of requesting a bunch of extra individual characters.
> Depending on how big the block is, there could be a whole lot of waste if
> the necessary characters are spread out over many blocks.
>
> Are you envisioning a block would contain adjacent code points, or would
> we be sorting Unicode into blocks of characters which are used frequently
> together, regardless of the numeric values of their code points? Would all
> the blocks be the same size?
>
> On Aug 5, 2019, at 7:02 PM, Ned Holbrook <ned@apple.com> wrote:
>
> I think this sounds reasonable. [As a side note, it reminds me of the
> technique of decreasing the accuracy of geolocation with population in
> order to maximize the confidentiality of individuals.]
>
> On Aug 5, 2019, at 3:59 PM, Garret Rieger <grieger@google.com> wrote:
>
> I've got an idea that could potentially help out on both the obfuscating
> requested characters for ideographic scripts and reducing the encoding
> costs for code point sets: what if for those fonts we bundled low frequency
> codepoints into a series of blocks. Client requests would ask for the
> blocks instead of specific code points. High frequency characters (say
> roughly the first couple thousand or so)  could still be requested
> individually.
>
> My reasoning is this:
>
>    - The presence of characters that occur sufficiently frequently enough
>    won't encode much if any useful information as to the contents of a
>    specific page. So it's likely safe to request those individually.
>    - For low frequency characters, requests for them are now obfuscated
>    since all you know is that the client needed at least one of the codepoints
>    in a specific block.
>    - In our work for segmenting CJK fonts into unicode ranges we found
>    that our low frequencies blocks are very rarely downloaded by clients. To
>    the point where the cost of downloading these low frequency blocks had a
>    negligible impact on the number of overall bytes users were downloading. So
>    even though every once in a while you may need to grab a block of
>    codepoints just to get one, over the long term that extra cost is nearly
>    negligible.
>    - Unlike with our unicode range approach we don't incur over head with
>    small block sizes so we could use blocks of 10-50 codepoints instead of
>    blocks of 250 code points. That should keep the penalty of getting a whole
>    block for one character pretty low.
>    - With the frequency remapping code point set encoding approach, most
>    of the cost of encoding a set is in representing code points in the low
>    frequency area. By blocking up that space the cost of encoding sets should
>    come down by a factor of the block size (10-50x improvement if we're using
>    blocks of 10-50 codepoints).
>    - If obfuscation even for high frequency codepoints is desired then we
>    could use block sizes which are inversely proportional to the frequency
>    (high frequency codepoints use small blocks, low frequency codepoints use
>    larger blocks).
>
> I'm curious what you think? Does this seem like a reasonable approach?
>
>
>
>
>
Received on Tuesday, 6 August 2019 16:58:56 UTC