- From: Garret Rieger <grieger@google.com>
- Date: Fri, 29 May 2020 17:33:54 -0700
- To: "w3c-webfonts-wg (public-webfonts-wg@w3.org)" <public-webfonts-wg@w3.org>
- Message-ID: <CAM=OCWYzsEdU_DBY5HjUOQcchoEjmkrEXesNYRLcO704XtnxcQ@mail.gmail.com>
As requested in the last working group meeting here are some details of how
we block codepoints together for splitting up a CJK font for unicode range
based serving.
External Presentation Slides:
https://www.unicodeconference.org/presentations-42/S5T3-Sheeter.pdf
This is the rough algorithm we use to generate frequency based buckets of
unicode characters used to split up CJK fonts for unicode range serving:
Inputs:
-
frequency_map: Frequency map of {unicode codepoint: frequency count}
-
sample_font: Representative font for the language (we typically use Noto
Sans JP/TC/SC/HK/KR)
-
frequency_threshold: only characters with frequency > then this
threshold will be pulled in via closure groups.
-
bucket_size: Number of codepoints in each bucket.
Outputs:
-
buckets: Subdivision of all of the codepoints in the frequency map into
‘num_buckets’ groups.
Algorithm Psuedo Code:
function find_all_sequences(font):
# Returns all codepoint sequences in the font which can trigger a
# GSUB or GPOS layout rule. We use a partially complete
# implementation which handles the simple GSUB/GPOS lookups
# but not the complex ones (ie. Chaining Context and Context based
# ones).
...
# Compute closures for each codepoint, that is the set of codepoints
# which may interact with it.
codepoint_groups = union_find()
for sequence in find_all_sequences(sample_font):
if (minimum frequency of any codepoint in sequence >
frequency_threshold):
codepoint_groups.union_all (codepoints in sequence)
for cp, frequency in frequency_map:
add all unicode canonical decompositions to codepoint_groups[cp] if
frequency > frequency_threshold
closures = dict()
for group in codepoint_groups:
for cp in group:
closures[cp] = group
buckets = list()
while frequency_map:
cps = set()
while cps.length < bucket_size:
next_cp = next_highest_frequency_codepoint(frequency_map)
reachable_cps = {next_cp} + closures[next_cp]
cps += reachable_cps
frequency_map -= reachable_cps
buckets.add(cps)
Received on Saturday, 30 May 2020 00:34:26 UTC