- From: Garret Rieger <grieger@google.com>
- Date: Mon, 5 Aug 2019 15:35:09 -0700
- To: "w3c-webfonts-wg (public-webfonts-wg@w3.org)" <public-webfonts-wg@w3.org>
- Message-ID: <CAM=OCWZYShSWd-cetDGmkD8yQs7K_qvHHGvzK9+GF_18AHQKYA@mail.gmail.com>
I wasn't satisfied with the encoding efficiencies I was achieving in my first exploration on code point set compression, particularly for CJK sets. So I iterated on the existing strategies and tried out some new approaches. Here's the list of new things that I tried: - Range encoding (https://en.wikipedia.org/wiki/Range_encoding) a type of entropy encoding. - Encoded a list of codepoints - Encoded a list of codepoint ranges. - A "hybrid" sparse bit set - encodes a union between a list of ranges and a sparse bit set. - Frequency based remapping: for any of the above strategies remap the codepoints to new values from [0, number of codepoints in the font - 1] ordered by the frequency of occurrence of those code points. I don't have a complete write up completed like I did with round one finished yet, but here's graphs of the encoding efficiency for CJK: https://docs.google.com/spreadsheets/d/1ljQHcq6arC5wWv-hw7XxMQxSREgBbGV1MuUyhaCX_bU/edit?usp=sharing . Most notably by using a combination of frequency remapping, sparse bit sets, and brotli I was able to achieve an efficiency of as low as 1 bit per codepoint for large sets. In the previous round the best efficiency achieved was over 4 bits per cp for CJK. A note on freq remapping: this requires us to know in advance which code points exist in the source font. To handle that the first request for a font would be sent with the code point set not remapped. Part of the response from the server would include a complete listing of the code points in the source font. The client could then apply remapping to any future requests. This will of course add some extra size to the first response, but having the client know in advance the specific code points in the source font has value beyond compressing the sets. For example if the browser knows which codepoints are in the font it doesn't need to waste requests/bytes sending augmentation requests for codepoints that aren't actually in the font.
Received on Monday, 5 August 2019 22:35:49 UTC