Re: Codepoint Frequency Data is Now Available (and IFT Demo Updates)

A couple of updates on this:

   - I've now gotten a copy of the frequency data files checked into
   github: https://github.com/w3c/ift-encoder-data
   - I've made some update to ift-encoder to work with the data release:
      - Added support for loading sharded files (PR
      <https://github.com/w3c/ift-encoder/pull/164>)
      - Added support for loading automatically from ift-encoder-data
      repository (PR <https://github.com/w3c/ift-encoder/pull/165>)


On Wed, Oct 29, 2025 at 4:32 PM Garret Rieger <grieger@google.com> wrote:

> I'm happy to announce that I've just made available the unicode codepoint
> frequency data that I gathered from the web search index! I was originally
> planning on releasing this on github but ran into issues with size so for
> now I'm hosting it on our CDN.
>
> Couple of things to help get started:
>
>    - You can find a README describing the data set here:
>    https://www.gstatic.com/fonts/unicode_frequency/v1/README.txt
>    - The data is released under the W3C Software and Document License
>    <https://www.w3.org/copyright/software-license-2023/> (same as the
>    ift-encoder repo)
>    - The list of individual data files that are available are listed
>    here: https://www.gstatic.com/fonts/unicode_frequency/v1/DATA_FILE_LIST
>    - Take any of the file names from that list and append it to
>    https://www.gstatic.com/fonts/unicode_frequency/v1/ to get the URL.
>    - For example:
>    https://www.gstatic.com/fonts/unicode_frequency/v1/Language_en.riegeli
>    - ift-encoder/util/freq_data_to_sorted_codepoints.cc
>    <https://github.com/w3c/ift-encoder/blob/main/util/freq_data_to_sorted_codepoints.cc>:
>    this utility demonstrates how to read data from these files, and can be
>    used to dump out the single codepoint frequencies from a data file into
>    text format.
>    - Additionally ift-encoder/util/closure_glyph_keyed_segmenter_util.cc
>    <https://github.com/w3c/ift-encoder/blob/main/util/closure_glyph_keyed_segmenter_util.cc> has
>    support for taking these data files in for use in segmentation.
>    - For programmatic access use the util::LoadFrequenciesFromRiegeli
>    <https://github.com/w3c/ift-encoder/blob/main/util/load_codepoints.h#L39>
>     API.
>    - However, note that these do not yet support the cases where a single
>    file is split into multiple parts (things of the from *.riegeli-*-of-*) as
>    that was a last minute change to the data set.
>
> I'm planning some changes to the ift-encoder implementations to better
> work with this data set:
>
>    - Add support for reading and joining split files.
>    - Add support for utilizing data directly from the CDN (currently
>    you'll have to download these to the local filesystem before being able to
>    use them).
>
> Lastly, I've updated the IFT demo (
> https://garretrieger.github.io/ift-demo/) with new IFT fonts that
> utilized this frequency data for segmentation generation. As part of that
> update I added Japanese IFT fonts and a couple of Japanese text samples to
> show them off.
>
> Let me know if there's any questions about the data set or if you have any
> issues using it.
>

Received on Monday, 3 November 2025 18:02:18 UTC