- From: Garret Rieger <grieger@google.com>
- Date: Fri, 29 May 2020 17:33:54 -0700
- To: "w3c-webfonts-wg (public-webfonts-wg@w3.org)" <public-webfonts-wg@w3.org>
- Message-ID: <CAM=OCWYzsEdU_DBY5HjUOQcchoEjmkrEXesNYRLcO704XtnxcQ@mail.gmail.com>
As requested in the last working group meeting here are some details of how we block codepoints together for splitting up a CJK font for unicode range based serving. External Presentation Slides: https://www.unicodeconference.org/presentations-42/S5T3-Sheeter.pdf This is the rough algorithm we use to generate frequency based buckets of unicode characters used to split up CJK fonts for unicode range serving: Inputs: - frequency_map: Frequency map of {unicode codepoint: frequency count} - sample_font: Representative font for the language (we typically use Noto Sans JP/TC/SC/HK/KR) - frequency_threshold: only characters with frequency > then this threshold will be pulled in via closure groups. - bucket_size: Number of codepoints in each bucket. Outputs: - buckets: Subdivision of all of the codepoints in the frequency map into ‘num_buckets’ groups. Algorithm Psuedo Code: function find_all_sequences(font): # Returns all codepoint sequences in the font which can trigger a # GSUB or GPOS layout rule. We use a partially complete # implementation which handles the simple GSUB/GPOS lookups # but not the complex ones (ie. Chaining Context and Context based # ones). ... # Compute closures for each codepoint, that is the set of codepoints # which may interact with it. codepoint_groups = union_find() for sequence in find_all_sequences(sample_font): if (minimum frequency of any codepoint in sequence > frequency_threshold): codepoint_groups.union_all (codepoints in sequence) for cp, frequency in frequency_map: add all unicode canonical decompositions to codepoint_groups[cp] if frequency > frequency_threshold closures = dict() for group in codepoint_groups: for cp in group: closures[cp] = group buckets = list() while frequency_map: cps = set() while cps.length < bucket_size: next_cp = next_highest_frequency_codepoint(frequency_map) reachable_cps = {next_cp} + closures[next_cp] cps += reachable_cps frequency_map -= reachable_cps buckets.add(cps)
Received on Saturday, 30 May 2020 00:34:26 UTC