Further Improving Codepoint Set Encoding Efficiency

While I was working on writing the specification for sparse bit sets I had
a couple of ideas on how we could further increase encoding efficiency:

   1. Vary the branch factor, which is currently fixed at 8.
   2. Modify the encoding to add efficient encoding of intervals via a zero
   byte.

To see if these changes would be worthwhile to add into the specification I
ran some simulations testing out the new strategies. The results can be
seen here:
https://github.com/w3c/PFE-analysis/blob/main/results/set_encoding_branch_factor.md

In summary:

   - Varying the branch factor produced universal improvement, in some
   cases reducing encoded sizes by up to 19%.
   - Adding interval encoding produced very small gains, but still appears
   to be worthwhile for specific cases, such as latin sets.

Therefore I propose that we include both changes in the specification for
sparse bit sets.

Received on Friday, 12 March 2021 22:40:28 UTC