Codepoint Frequency Data is Now Available (and IFT Demo Updates) from Garret Rieger on 2025-10-29 (public-webfonts-wg@w3.org from October 2025)

From: Garret Rieger <grieger@google.com>
Date: Wed, 29 Oct 2025 16:32:11 -0600
To: "w3c-webfonts-wg (public-webfonts-wg@w3.org)" <public-webfonts-wg@w3.org>
Message-ID: <CAM=OCWaW9q9LYvMP-9ueyDrV4oSNcM30OY2_G84rXzuSf2N0bw@mail.gmail.com>

I'm happy to announce that I've just made available the unicode codepoint
frequency data that I gathered from the web search index! I was originally
planning on releasing this on github but ran into issues with size so for
now I'm hosting it on our CDN.

Couple of things to help get started:

   - You can find a README describing the data set here:
   https://www.gstatic.com/fonts/unicode_frequency/v1/README.txt
   - The data is released under the W3C Software and Document License
   <https://www.w3.org/copyright/software-license-2023/> (same as the
   ift-encoder repo)
   - The list of individual data files that are available are listed here:
   https://www.gstatic.com/fonts/unicode_frequency/v1/DATA_FILE_LIST
   - Take any of the file names from that list and append it to
   https://www.gstatic.com/fonts/unicode_frequency/v1/ to get the URL.
   - For example:
   https://www.gstatic.com/fonts/unicode_frequency/v1/Language_en.riegeli
   - ift-encoder/util/freq_data_to_sorted_codepoints.cc
   <https://github.com/w3c/ift-encoder/blob/main/util/freq_data_to_sorted_codepoints.cc>:
   this utility demonstrates how to read data from these files, and can be
   used to dump out the single codepoint frequencies from a data file into
   text format.
   - Additionally ift-encoder/util/closure_glyph_keyed_segmenter_util.cc
   <https://github.com/w3c/ift-encoder/blob/main/util/closure_glyph_keyed_segmenter_util.cc>
has
   support for taking these data files in for use in segmentation.
   - For programmatic access use the util::LoadFrequenciesFromRiegeli
   <https://github.com/w3c/ift-encoder/blob/main/util/load_codepoints.h#L39>
    API.
   - However, note that these do not yet support the cases where a single
   file is split into multiple parts (things of the from *.riegeli-*-of-*) as
   that was a last minute change to the data set.

I'm planning some changes to the ift-encoder implementations to better work
with this data set:

   - Add support for reading and joining split files.
   - Add support for utilizing data directly from the CDN (currently you'll
   have to download these to the local filesystem before being able to use
   them).

Lastly, I've updated the IFT demo (https://garretrieger.github.io/ift-demo/)
with new IFT fonts that utilized this frequency data for segmentation
generation. As part of that update I added Japanese IFT fonts and a couple
of Japanese text samples to show them off.

Let me know if there's any questions about the data set or if you have any
issues using it.

Received on Wednesday, 29 October 2025 22:32:34 UTC