- From: Garret Rieger <grieger@google.com>
- Date: Wed, 29 Oct 2025 16:32:11 -0600
- To: "w3c-webfonts-wg (public-webfonts-wg@w3.org)" <public-webfonts-wg@w3.org>
- Message-ID: <CAM=OCWaW9q9LYvMP-9ueyDrV4oSNcM30OY2_G84rXzuSf2N0bw@mail.gmail.com>
I'm happy to announce that I've just made available the unicode codepoint
frequency data that I gathered from the web search index! I was originally
planning on releasing this on github but ran into issues with size so for
now I'm hosting it on our CDN.
Couple of things to help get started:
- You can find a README describing the data set here:
https://www.gstatic.com/fonts/unicode_frequency/v1/README.txt
- The data is released under the W3C Software and Document License
<https://www.w3.org/copyright/software-license-2023/> (same as the
ift-encoder repo)
- The list of individual data files that are available are listed here:
https://www.gstatic.com/fonts/unicode_frequency/v1/DATA_FILE_LIST
- Take any of the file names from that list and append it to
https://www.gstatic.com/fonts/unicode_frequency/v1/ to get the URL.
- For example:
https://www.gstatic.com/fonts/unicode_frequency/v1/Language_en.riegeli
- ift-encoder/util/freq_data_to_sorted_codepoints.cc
<https://github.com/w3c/ift-encoder/blob/main/util/freq_data_to_sorted_codepoints.cc>:
this utility demonstrates how to read data from these files, and can be
used to dump out the single codepoint frequencies from a data file into
text format.
- Additionally ift-encoder/util/closure_glyph_keyed_segmenter_util.cc
<https://github.com/w3c/ift-encoder/blob/main/util/closure_glyph_keyed_segmenter_util.cc>
has
support for taking these data files in for use in segmentation.
- For programmatic access use the util::LoadFrequenciesFromRiegeli
<https://github.com/w3c/ift-encoder/blob/main/util/load_codepoints.h#L39>
API.
- However, note that these do not yet support the cases where a single
file is split into multiple parts (things of the from *.riegeli-*-of-*) as
that was a last minute change to the data set.
I'm planning some changes to the ift-encoder implementations to better work
with this data set:
- Add support for reading and joining split files.
- Add support for utilizing data directly from the CDN (currently you'll
have to download these to the local filesystem before being able to use
them).
Lastly, I've updated the IFT demo (https://garretrieger.github.io/ift-demo/)
with new IFT fonts that utilized this frequency data for segmentation
generation. As part of that update I added Japanese IFT fonts and a couple
of Japanese text samples to show them off.
Let me know if there's any questions about the data set or if you have any
issues using it.
Received on Wednesday, 29 October 2025 22:32:34 UTC