Glyph Closure Scaling

Hi, all!

I did some investigation into how the size of the glyph closure scales with the number of input glyphs. Here are the results:



Each font on the system is a curve on that graph. The X axis is how many glyphs were requested to be present in the font subset, as a percentage of the total number of glyphs in the file. The Y axis is how many glyphs ended up in the glyph closure, as a percentage of the total number of glyphs in the file.

A straight line from the bottom left to the top right indicates that the number of input glyphs equals the number of glyphs in the glyph closure. This means that either A) There are no shaping rules, or B) The shaping rules for a particular glyph only rely on a small number of other glyphs. The straighter the line, the more “embarassingly parallel” the font is to split up, and the more a codepoint-based solution would work. The higher above y=x a particular font’s curve lies, the less effective a codepoint-based solution becomes, because no matter how few characters you think you need, you actually need a huge percent of the font file to show them.

When I gathered this data, I iterated through every font on the system, and for each font, created ’n’ subsetted fonts, where ’n’ is the number of glyphs in that font. Each subsetted font holds glyphs 0-n from the original font, plus whichever additional glyphs the glyph closure requires. The choice of picking glyphs 0-n is an intentional one; in general, fonts tend to keep letters from the same alphabet in adjacent glyphs, and most fonts that support non-alphabetic languages don’t have many shaping rules.

There are 217 fonts on that graph, so you can see that the vast majority of fonts are very close to the y=x line, indicating that a codepoint-based solution is often good enough. However, a fair amount of fonts deviate dramatically from this line. I wanted to characterize this, so I wanted to calculate a “curvy-ness” for each line on the graph, which would indicate how close it is to y=x. After staring at the Wikipedia page for Curvature for a while and not immediately understanding enough of it, it occurred to me that, because the curves are all strictly increasing and are all bound by the same domain and range, simply taking the integral of each curve would give me a rough estimate for what I want. The y=x curves will have an integral of around 0.5, and the curves that deviate dramatically from that will have an integral approaching 1.0. So, here’s the integrals (each font is a point on this chart):



And here is some information from the fonts which deviate dramatically from y=x:

SIMPO.TTF
Simplified Arabic
Arabic
55.430566795
SIMPBDO.TTF
Simplified Arabic
Arabic
55.455180571
BAHNSCHRIFT.TTF
Bahnscrift
Cyrillic; Greek; Latin
55.515716155
SIMPFXO.TTF
Simplified Arabic Fixed
Arabic
56.010571152
TRADO.TTF
Traditional Arabic
Arabic
56.738334771
TRADBDO.TTF
Traditional Arabic
Arabic
56.978956981
MAJALLA.TTF
Sakkal Majalla
Arabic
57.645557254
MAJALLAB.TTF
Sakkal Majalla
Arabic
57.688329142
MUSEOSANSFORDELL-BOLD.TTF
Museo Sans For Dell

57.835886141
MUSEOSANSFORDELL-REGULAR.TTF
Museo Sans For Dell

57.835886141
ANDLSO.TTF
Andalus
Arabic
58.310219424
ROBOTO-REGULAR.TTF
Roboto

59.5144
MSUIGHUR.TTF
Microsoft Uighur
Arabic; ug-Arab
60.165190808
MSUIGHUB.TTF
Microsoft Uighur
Arabic; ug-Arab
60.193214853
PHAGSPAB.TTF
Microsoft PhagsPa
Phags-pa
62.174107469
PHAGSPA.TTF
Microsoft PhagsPa
Phags-pa
62.244791699
MMRTEXT.TTF
Myanmar Text
Myanmar
62.375873702
MMRTEXTB.TTF
Myanmar Text
Myanmar
62.375873702
ALDHABI.TTF
Aldhabi
Arabic; Arabic (Nastaliq variant)
65.334787002
ARABTYPE.TTF
Arabic Typesetting
Arabic
67.119839053
URDTYPEB.TTF
Urdu Typesetting
Arabic; Arabic (Nastaliq variant)
68.552915869
URDTYPE.TTF
Urdu Typesetting
Arabic; Arabic (Nastaliq variant)
68.574514189
SEGOESCB.TTF
Segoe Script
Cyrillic; Greek; Latin
69.708548333
SEGOESC.TTF
Segoe Script
Cyrillic; Greek; Latin
69.9902052035
HIMALAYA.TTF
Microsoft Himalaya
Tibetan
77.638392788

One last thing: while gathering this data, I discovered that creating a subset for a font can be extremely slow. For example, creating a subset with half of the glyphs of PingFang takes 40 seconds on my laptop. From the explanation above, I was creating one subset for every glyph of every font on the system, which means that gathering this data took quite a long time. After realizing how long this would take, I decided to run this on the most powerful computer I have access to, which happens to be my gaming machine that runs Windows. Therefore, the corpus this was run on is all the fonts that are preinstalled on my Windows machine. I’m in the process of gathering the same kind of data from the Google Fonts corpus and from the set of fonts preinstalled on macOS.

Also, the smart server approach to streamable fonts creates two subsets and then computes a binary diff, which could take a seriously long time if it’s built on top of fonttools. I hope we can make sure that this additional cost is included in the model we build.

Thanks,
Myles

P.S. If the size of the font is small, it sort of doesn’t matter how well it scales. I’ll soon send a scatter plot of the integrals on one axis against file sizes on another axis, so we can see if these difficult fonts actually matter.

Received on Monday, 5 August 2019 05:23:48 UTC