Re: Glyph Closure Scaling

Thanks Myles, this is very informative. For the arabic fonts it makes
intuitive sense to me that they'd deviate due to them having lots of inter
character layout rules. However, seeing a few LGC fonts end up deviating
surprises me. I think it would be interesting to dig into the internals of
a couple of those LGC fonts to try and understand why the closure on those
pulls in so many extra glyphs.

Regarding slowness of using fontTools, that's a well known issue which is
why we're developing the harfbuzz subsetter. We haven't fully implemented
GSUB/GPOS subsetting in harfbuzz yet, but so far looks like the new
implementation should far exceed the performance of font tools in the range
of 300-1000x faster.

On Sun, Aug 4, 2019 at 10:23 PM Myles C. Maxfield <mmaxfield@apple.com>
wrote:

> Hi, all!
>
> I did some investigation into how the size of the glyph closure scales
> with the number of input glyphs. Here are the results:
>
>
> Each font on the system is a curve on that graph. The X axis is how many
> glyphs were requested to be present in the font subset, as a percentage of
> the total number of glyphs in the file. The Y axis is how many glyphs ended
> up in the glyph closure, as a percentage of the total number of glyphs in
> the file.
>
> A straight line from the bottom left to the top right indicates that the
> number of input glyphs equals the number of glyphs in the glyph closure.
> This means that either A) There are no shaping rules, or B) The shaping
> rules for a particular glyph only rely on a small number of other glyphs.
> The straighter the line, the more “embarassingly parallel” the font is to
> split up, and the more a codepoint-based solution would work. The higher
> above y=x a particular font’s curve lies, the less effective a
> codepoint-based solution becomes, because no matter how few characters you
> think you need, you actually need a huge percent of the font file to show
> them.
>
> When I gathered this data, I iterated through every font on the system,
> and for each font, created ’n’ subsetted fonts, where ’n’ is the number of
> glyphs in that font. Each subsetted font holds glyphs 0-n from the original
> font, plus whichever additional glyphs the glyph closure requires. The
> choice of picking glyphs 0-n is an intentional one; in general, fonts tend
> to keep letters from the same alphabet in adjacent glyphs, and most fonts
> that support non-alphabetic languages don’t have many shaping rules.
>
> There are 217 fonts on that graph, so you can see that the vast majority
> of fonts are very close to the y=x line, indicating that a codepoint-based
> solution is often good enough. However, a fair amount of fonts deviate
> dramatically from this line. I wanted to characterize this, so I wanted to
> calculate a “curvy-ness” for each line on the graph, which would indicate
> how close it is to y=x. After staring at the Wikipedia page for Curvature
> for a while and not immediately understanding enough of it, it occurred to
> me that, because the curves are all strictly increasing and are all bound
> by the same domain and range, simply taking the integral of each curve
> would give me a rough estimate for what I want. The y=x curves will have an
> integral of around 0.5, and the curves that deviate dramatically from that
> will have an integral approaching 1.0. So, here’s the integrals (each font
> is a point on this chart):
>
>
> And here is some information from the fonts which deviate dramatically
> from y=x:
>
> *SIMPO.TTF*
> *Simplified Arabic*
> *Arabic*
> 55.430566795
> *SIMPBDO.TTF*
> *Simplified Arabic*
> *Arabic*
> 55.455180571
> *BAHNSCHRIFT.TTF*
> *Bahnscrift*
> *Cyrillic; Greek; Latin*
> 55.515716155
> *SIMPFXO.TTF*
> *Simplified Arabic Fixed*
> *Arabic*
> 56.010571152
> *TRADO.TTF*
> *Traditional Arabic*
> *Arabic*
> 56.738334771
> *TRADBDO.TTF*
> *Traditional Arabic*
> *Arabic*
> 56.978956981
> *MAJALLA.TTF*
> *Sakkal Majalla*
> *Arabic*
> 57.645557254
> *MAJALLAB.TTF*
> *Sakkal Majalla*
> *Arabic*
> 57.688329142
> *MUSEOSANSFORDELL-BOLD.TTF*
> *Museo Sans For Dell*
>
> 57.835886141
> *MUSEOSANSFORDELL-REGULAR.TTF*
> *Museo Sans For Dell*
>
> 57.835886141
> *ANDLSO.TTF*
> *Andalus*
> *Arabic*
> 58.310219424
> *ROBOTO-REGULAR.TTF*
> *Roboto*
>
> 59.5144
> *MSUIGHUR.TTF*
> *Microsoft Uighur*
> *Arabic; ug-Arab*
> 60.165190808
> *MSUIGHUB.TTF*
> *Microsoft Uighur*
> *Arabic; ug-Arab*
> 60.193214853
> *PHAGSPAB.TTF*
> *Microsoft PhagsPa*
> *Phags-pa*
> 62.174107469
> *PHAGSPA.TTF*
> *Microsoft PhagsPa*
> *Phags-pa*
> 62.244791699
> *MMRTEXT.TTF*
> *Myanmar Text*
> *Myanmar*
> 62.375873702
> *MMRTEXTB.TTF*
> *Myanmar Text*
> *Myanmar*
> 62.375873702
> *ALDHABI.TTF*
> *Aldhabi*
> *Arabic; Arabic (Nastaliq variant)*
> 65.334787002
> *ARABTYPE.TTF*
> *Arabic Typesetting*
> *Arabic*
> 67.119839053
> *URDTYPEB.TTF*
> *Urdu Typesetting*
> *Arabic; Arabic (Nastaliq variant)*
> 68.552915869
> *URDTYPE.TTF*
> *Urdu Typesetting*
> *Arabic; Arabic (Nastaliq variant)*
> 68.574514189
> *SEGOESCB.TTF*
> *Segoe Script*
> *Cyrillic; Greek; Latin*
> 69.708548333
> *SEGOESC.TTF*
> *Segoe Script*
> *Cyrillic; Greek; Latin*
> 69.9902052035
> *HIMALAYA.TTF*
> *Microsoft Himalaya*
> *Tibetan*
> 77.638392788
>
> One last thing: while gathering this data, I discovered that creating a
> subset for a font can be extremely slow. For example, creating a subset
> with half of the glyphs of PingFang takes 40 seconds on my laptop. From the
> explanation above, I was creating one subset for every glyph of every font
> on the system, which means that gathering this data took quite a long time.
> After realizing how long this would take, I decided to run this on the most
> powerful computer I have access to, which happens to be my gaming machine
> that runs Windows. Therefore, the corpus this was run on is all the fonts
> that are preinstalled on my Windows machine. I’m in the process of
> gathering the same kind of data from the Google Fonts corpus and from the
> set of fonts preinstalled on macOS.
>
> Also, the smart server approach to streamable fonts creates two subsets
> and then computes a binary diff, which could take a seriously long time if
> it’s built on top of fonttools. I hope we can make sure that this
> additional cost is included in the model we build.
>
> Thanks,
> Myles
>
> P.S. If the size of the font is small, it sort of doesn’t matter how well
> it scales. I’ll soon send a scatter plot of the integrals on one axis
> against file sizes on another axis, so we can see if these difficult fonts
> actually matter.
>

Received on Monday, 5 August 2019 16:41:40 UTC