Codepoint Set Compression Round 2 from Garret Rieger on 2019-08-05 (public-webfonts-wg@w3.org from August 2019)

From: Garret Rieger <grieger@google.com>
Date: Mon, 5 Aug 2019 15:35:09 -0700
To: "w3c-webfonts-wg (public-webfonts-wg@w3.org)" <public-webfonts-wg@w3.org>
Message-ID: <CAM=OCWZYShSWd-cetDGmkD8yQs7K_qvHHGvzK9+GF_18AHQKYA@mail.gmail.com>

I wasn't satisfied with the encoding efficiencies I was achieving in my
first exploration on code point set compression, particularly for CJK sets.
So I iterated on the existing strategies and tried out some new approaches.

Here's the list of new things that I tried:

- Range encoding (https://en.wikipedia.org/wiki/Range_encoding) a type
of entropy encoding.
- Encoded a list of codepoints
- Encoded a list of codepoint ranges.
- A "hybrid" sparse bit set - encodes a union between a list of ranges
and a sparse bit set.
- Frequency based remapping: for any of the above strategies remap the
codepoints to new values from [0, number of codepoints in the font - 1]
ordered by the frequency of occurrence of those code points.

I don't have a complete write up completed like I did with round one
finished yet, but here's graphs of the encoding efficiency for CJK:
https://docs.google.com/spreadsheets/d/1ljQHcq6arC5wWv-hw7XxMQxSREgBbGV1MuUyhaCX_bU/edit?usp=sharing
.

Most notably by using a combination of frequency remapping, sparse bit
sets, and brotli I was able to achieve an efficiency of as low as 1 bit per
codepoint for large sets. In the previous round the best efficiency
achieved was over 4 bits per cp for CJK.

A note on freq remapping: this requires us to know in advance which code
points exist in the source font. To handle that the first request for a
font would be sent with the code point set not remapped. Part of the
response from the server would include a complete listing of the code
points in the source font. The client could then apply remapping to any
future requests.

This will of course add some extra size to the first response, but having
the client know in advance the specific code points in the source font has
value beyond compressing the sets. For example if the browser knows which
codepoints are in the font it doesn't need to waste requests/bytes sending
augmentation requests for codepoints that aren't actually in the font.

Received on Monday, 5 August 2019 22:35:49 UTC