Re: "Complex" glyph substitution from Garret Rieger on 2025-12-10 (public-webfonts-wg@w3.org from December 2025)

From: Garret Rieger <grieger@google.com>
Date: Wed, 10 Dec 2025 11:12:09 -0700
To: Skef Iterum <siterum@adobe.com>
Cc: "public-webfonts-wg@w3.org" <public-webfonts-wg@w3.org>
Message-ID: <CAM=OCWb8=-a2bBN+APZK8=gTuiuYPX66FsT8kXFf+SLtbE0YqQ@mail.gmail.com>
I put together a script to run the closure analysis across a collection of
fonts and report the number of fallback glyphs (and bytes for fallback
glyphs) per font (PR <https://github.com/w3c/ift-encoder/pull/169>). Next,
I ran that analysis across the open source Google Fonts collection
<https://github.com/google/fonts/>. Here's the resulting data from that run:

https://docs.google.com/spreadsheets/d/1W1lAJq2kdqn7Z6kvAyZXNToWMmjtTpxyyLIS7ORCiXM/edit?usp=sharing

To explain what this is showing:

   - The chart gives a histogram (data point per font file) of the
   percentage of each font's outline data (glyf, CFF, CFF2, post brotli
   compression) that are fallback glyphs.
   - Fallback glyphs
   <https://docs.google.com/presentation/d/1J849YLVaHm_HINeL3FhdSn3Ff-F_JzM_Cg0ZE_vUsvQ/edit?slide=id.g3a10c1ffb92_0_304#slide=id.g3a10c1ffb92_0_304>
   are any glyphs in the font which the closure analysis could not find an
   activation condition for.
   - Here the closure analysis starts with each codepoint and non-default
   feature <https://w3c.github.io/IFT/Overview.html#feature-tag-list> in
   the font as an input segment, no patch merging or further processing is
   performed beyond condition detection via closure analysis
   <https://docs.google.com/presentation/d/1J849YLVaHm_HINeL3FhdSn3Ff-F_JzM_Cg0ZE_vUsvQ/edit?slide=id.g3a10c1ffb92_0_143#slide=id.g3a10c1ffb92_0_143>
   .
   - The raw data can be found in the second tab in the sheet.

High level summary of the results:

   - At the median 6.9% of outline data is fallback glyphs.
   - At the 75th percentile 13% of outline data is fallback glyphs.
   - At the 95th percentile 32% of outline data is fallback glyphs.

Overall it's looking like the closure analysis is working reasonably well
for most fonts. However, there's a small handful of fonts that have an
extremely high percentage of fallback glyphs, these cover a variety of
languages including some CJK ones. I haven't dug into these any further,
but I suspect these will provide some interesting insights into areas where
the current closure analysis approach struggles.

For next steps, there's two main areas of focus:

   1. Dig into the outlier fonts surfaced in this analysis and understand
   what's going on with them.
   2. I'm currently working on a more advanced composite condition detector
   which hopefully should be able to eliminate fallback glyphs in most cases.
   It still needs further development, but once I've got it a bit further
   along I'd like to re-run this analysis with it enabled to see the impact on
   fallback glyph percentages.


On Mon, Dec 1, 2025 at 5:49 PM Garret Rieger <grieger@google.com> wrote:

>
>
> On Tue, Nov 11, 2025 at 7:59 PM Skef Iterum <siterum@adobe.com> wrote:
>
>> At the TPAC WFWG meeting we discussed one difficult aspect of the
>> glyph-keyed patch encoding problem: glyphs with more “complex” substitution
>> patterns that the encoder punts on and just includes in the initial font. I
>> mentioned that this was a significant problem for certain fonts when I was
>> working on the IFTB prototype, which raised the question of whether I
>> remembered what any of those fonts were.
>>
>>
>>
>> Short answer: I don’t.
>>
>>
>>
>> However, I was looking through some old slides and they do mention a
>> pattern, which was that big aalt and nalt features tended to cause
>> problems. I also recall issues with vertical layout in Japanese. So these
>> are the things I recommend looking at first.
>>
>
> Sounds good, I can look through the open source Google Fonts collection
> and see if I can find some fonts which have either aalt, nalt, or vertical
> layout which might be good test cases.
>
>
>>
>>
>> More generally, I think it would not take too long to build a simple
>> evaluator using the tree built with my draft depend branch of HarfBuzz:
>> https://github.com/skef/harfbuzz/tree/depend . Maybe that’s not much
>> different from just encoding each of the relevant fonts and seeing how
>> large the list of punts is, but evaluation code could be more specific
>> about what is causing the problems without having to go in and look by
>> hand.
>>
>>
>>
>> I’m also probably in a position where I can run any or all of the fonts
>> in Adobe’s library through an encoder, and while I obviously wouldn’t be
>> able to share the actual fonts freely, I could characterize the issue and
>> in some cases check with the foundry about providing limited access for
>> research purposes.
>>
>
> This sounds good. My very rough plan is to take the route of using the
> existing analysis and set up a small script that can run just the closure
> analysis portion of the segmenter on a collection of fonts and report back
> the number of fallback glyphs relative to the total number of glyphs for
> each font. This should pretty quickly identify any fonts that are currently
> handled poorly. Once we have some examples in hand we can do more indepth
> analysis (including potentially using the dependency branch) of those to
> see where issues are coming up. I was going to initially run that against
> the Google Fonts collection, and if you're able to also run it against the
> Adobe collection that would be very helpful.
>
> During the last few days of the conference before I left for vacation I
> implemented an early prototype of a detector that can find for each
> fallback glyph the list of segments that make up the glyph's composite
> condition. For example for a glyph with a composite condition of (A and B)
> OR (C and D), it can find the set {A, B, C, D}. With this in hand we have
> two options:
>
>    - Assign the glyph a super set condition which is a disjunction across
>    the segments in the set (in the previous example this would be A OR B OR C
>    OR D). This condition will always match at least when the true condition
>    would (over matching is functionally fine, just less efficient than the
>    true condition).
>    - Use the set to reduce the scope of a more complex analysis that
>    finds the true condition.
>
> It's still early stages, but in the few fonts I tested on it was able to
> successfully classify all fallback glyphs, so that's a pretty promising
> start. Still needs some work, and I'll share some more details soon once I
> have some time to put together a writeup on the approach.
>
>
>>
>>
>> Skef
>>
>
Received on Wednesday, 10 December 2025 18:12:31 UTC