- From: Garret Rieger <grieger@google.com>
- Date: Mon, 3 Nov 2025 11:01:54 -0700
- To: "w3c-webfonts-wg (public-webfonts-wg@w3.org)" <public-webfonts-wg@w3.org>
- Message-ID: <CAM=OCWaYGjEJiNOWWGBy+tWDwZU7x10KdVA-K4Z9ZmjtcjRStw@mail.gmail.com>
A couple of updates on this:
- I've now gotten a copy of the frequency data files checked into
github: https://github.com/w3c/ift-encoder-data
- I've made some update to ift-encoder to work with the data release:
- Added support for loading sharded files (PR
<https://github.com/w3c/ift-encoder/pull/164>)
- Added support for loading automatically from ift-encoder-data
repository (PR <https://github.com/w3c/ift-encoder/pull/165>)
On Wed, Oct 29, 2025 at 4:32 PM Garret Rieger <grieger@google.com> wrote:
> I'm happy to announce that I've just made available the unicode codepoint
> frequency data that I gathered from the web search index! I was originally
> planning on releasing this on github but ran into issues with size so for
> now I'm hosting it on our CDN.
>
> Couple of things to help get started:
>
> - You can find a README describing the data set here:
> https://www.gstatic.com/fonts/unicode_frequency/v1/README.txt
> - The data is released under the W3C Software and Document License
> <https://www.w3.org/copyright/software-license-2023/> (same as the
> ift-encoder repo)
> - The list of individual data files that are available are listed
> here: https://www.gstatic.com/fonts/unicode_frequency/v1/DATA_FILE_LIST
> - Take any of the file names from that list and append it to
> https://www.gstatic.com/fonts/unicode_frequency/v1/ to get the URL.
> - For example:
> https://www.gstatic.com/fonts/unicode_frequency/v1/Language_en.riegeli
> - ift-encoder/util/freq_data_to_sorted_codepoints.cc
> <https://github.com/w3c/ift-encoder/blob/main/util/freq_data_to_sorted_codepoints.cc>:
> this utility demonstrates how to read data from these files, and can be
> used to dump out the single codepoint frequencies from a data file into
> text format.
> - Additionally ift-encoder/util/closure_glyph_keyed_segmenter_util.cc
> <https://github.com/w3c/ift-encoder/blob/main/util/closure_glyph_keyed_segmenter_util.cc> has
> support for taking these data files in for use in segmentation.
> - For programmatic access use the util::LoadFrequenciesFromRiegeli
> <https://github.com/w3c/ift-encoder/blob/main/util/load_codepoints.h#L39>
> API.
> - However, note that these do not yet support the cases where a single
> file is split into multiple parts (things of the from *.riegeli-*-of-*) as
> that was a last minute change to the data set.
>
> I'm planning some changes to the ift-encoder implementations to better
> work with this data set:
>
> - Add support for reading and joining split files.
> - Add support for utilizing data directly from the CDN (currently
> you'll have to download these to the local filesystem before being able to
> use them).
>
> Lastly, I've updated the IFT demo (
> https://garretrieger.github.io/ift-demo/) with new IFT fonts that
> utilized this frequency data for segmentation generation. As part of that
> update I added Japanese IFT fonts and a couple of Japanese text samples to
> show them off.
>
> Let me know if there's any questions about the data set or if you have any
> issues using it.
>
Received on Monday, 3 November 2025 18:02:18 UTC