Re: Glyph Closure Scaling

> On Aug 8, 2019, at 2:58 PM, Garret Rieger <grieger@google.com> wrote:
> 
> Thanks for running this analysis over the Google Fonts collection. A few thoughts:
> Makes sense to me that it's most arabic/indic that appears to have poor scaling.
> You did your analysis by looking at glyph counts, I wonder if the results would change much if it instead counted total glyph bytes in the closure.
The choice of counting glyphs rather than bytes was intentional (indeed, in an early version of the analysis, I was measuring file sizes, but then changed it to count glyphs instead). The goal of this experiment is to determine how tightly-connected certain scripts / fonts are, which is a distinct concept from the distribution of from the visual complexity / number of control point outlines across a font. I’m trying to measure attributes of script shaping, not about outlines.

That being said, I intuitively wouldn’t expect the data to change much if we measure bytes. But I’m unlikely to repeat this analysis since it kept my gaming machine on at 100% CPU load with the fans screaming for a couple days straight. I used this script <https://github.com/litherum/SubsetScaling/blob/master/SubsetScaling/SubsetScaling.py> to run the analysis, and anyone else should feel free to modify it to measure any other metric.
> For the closure did you use the default layout feature selection, or --layout_features=*?
I didn’t touch the set of layout features.
> Something I think that's missing from this analysis is a comparison to what would happen if we were instead requesting just the glyphs we needed that came from the result of shaping. That analysis could look something like this:
> Get sample text that uses codepoints from a font.
> Compute the glyph closure on those codepoints.
> Compute the exact set of glyphs needed for that specific sequence of text (should be smaller then the full closure).
Right, this analysis didn’t consider sequences of glyphs; it only subsetted glyphs 0-n. I wanted to see how the glyph sets scale through the font, which means I would have had to pick many representative strings of various lengths, ending up with a string that covers every glyph in the file. This seemed too difficult to find/generate, so I opted for a simpler approach which should still be able to answer the question.
> Compare the difference between those two. That delta represents the unnecessary glyph data that would be sent if we requested codepoints vs glyph ids.
> I suspect for things like Arabic and Indic, that using either method you'll end up with a majority of the font for any medium to large size piece of text.
> For Indic even if within the individual scripts we don't get much benefit from PFE it's still valuable to use. We have several families which we have to serve each script as a separate family (for example: https://fonts.google.com/?query=baloo <https://fonts.google.com/?query=baloo>).
For legal reasons?
> Ideally we'd like to serve this as a single font which contains all the scripts and is progressively enriched
I believe this opens up a whole new can of worms. What font family does the CSS use? How does the page author know which fonts the server will actually end up replying with? How does that work for attribution & page load pricing of fonts? If the reason you can’t serve a single font is because of legal reasons, it’s unclear whether or not serving technically-different-yet-behaves-to-users-as-if-it-is-a-single-font would actually alleviate these reasons.
> so that users need only download the data for the script(s) they need. Usual techniques like unicode range won't work since there are shared codepoint between the scripts and you can get broken rendering if scripts are mixed (unicode range selects the shared codepoint from the wrong file).
> 
> On Wed, Aug 7, 2019 at 12:25 PM Myles C. Maxfield <mmaxfield@apple.com <mailto:mmaxfield@apple.com>> wrote:
> 
> 
>> On Aug 7, 2019, at 12:03 PM, Myles C. Maxfield <mmaxfield@apple.com <mailto:mmaxfield@apple.com>> wrote:
>> 
>> Here’s the data from Google Fonts:
>> 
>> <Screen Shot 2019-08-06 at 10.24.13 PM.png>
>> 
>> <Screen Shot 2019-08-06 at 10.03.43 PM.png>
>> 
>> <Screen Shot 2019-08-06 at 10.05.23 PM.png>
> 
> The print is kind of small, but the X axis is pretty interesting.
> 
> Here’s the same chart, but zoomed in so that the right edge of the graph is at 1 megabyte (which would still be a sizable webfont)
> 
> <Screen Shot 2019-08-07 at 12.23.15 PM.png>
> 
> Things look significantly worse.
> 
>> 
>> Looks a bit more difficult than the Windows fonts.
>> 
>> NTR-Regular.ttf
>> Telugu
>> 78.736248482
>> Lohit-Bengali.ttf
>> Bengali
>> 78.831622569
>> TenaliRamakrishna-Regular.ttf
>> Telugu, Latin
>> 81.910529391
>> Peddana-Regular.ttf
>> Telugu, Latin
>> 82.017367
>> Ramaraja-Regular.ttf
>> Telugu, Latin
>> 82.060643793
>> Ponnala-Regular.ttf
>> Telugu
>> 82.925248156
>> Sitara-Regular.ttf
>> Devanagari, Latin
>> 83.176943573
>> Sitara-Bold.ttf
>> Devanagari, Latin
>> 83.176943573
>> Sitara-BoldItalic.ttf
>> Devanagari, Latin
>> 83.186576710
>> Sitara-Italic.ttf
>> Devanagari, Latin
>> 83.186576710
>> Amiri-Italic.ttf
>> Arabic
>> 83.235126624
>> Amiri-BoldItalic.ttf
>> Arabic
>> 83.304386938
>> Amiri-Regular.ttf
>> Arabic
>> 83.363365799
>> Amiri-Bold.ttf
>> Arabic
>> 83.39951432
>> SreeKrushnadevaraya-Regular.ttf
>> Telugu
>> 85.420147454
>> Suranna-Regular.ttf
>> Telugu
>> 85.4847986935
>> Taprom.ttf
>> Khmer
>> 85.498475514
>> Angkor-Regular.ttf
>> Khmer
>> 85.498475514
>> Timmana-Regular.ttf
>> Telugu
>> 85.7927372355
>> Chathura-ExtraBold.ttf
>> Telugu, Latin
>> 86.099921648
>> Chathura-Regular.ttf
>> Telugu, Latin
>> 86.099921648
>> Chathura-Bold.ttf
>> Telugu, Latin
>> 86.099921648
>> Chathura-Thin.ttf
>> Telugu, Latin
>> 86.099921648
>> Chathura-Light.ttf
>> Telugu, Latin
>> 86.099921648
>> Bokor-Regular.ttf
>> Khmer
>> 86.153956081
>> Moul.ttf
>> Khmer
>> 86.153956081
>> Siemreap.ttf
>> Khmer
>> 86.153956081
>> Dangrek.ttf
>> Khmer
>> 86.153956081
>> Metal.ttf
>> Khmer
>> 86.153956081
>> Moulpali.ttf
>> Khmer
>> 86.153956081
>> Content-Bold.ttf
>> Khmer
>> 86.153956081
>> Content-Regular.ttf
>> Khmer
>> 86.153956081
>> Freehand.ttf
>> Khmer
>> 86.153956081
>> Siemreap.ttf
>> Khmer
>> 86.153956081
>> Koulen.ttf
>> Khmer
>> 86.153956081
>> Preahvihear.ttf
>> Khmer
>> 86.272334086
>> Bayon-Regular.ttf
>> Khmer
>> 86.272334086
>> Chenla.ttf
>> Khmer
>> 86.272334086
>> OdorMeanChey.ttf
>> Khmer
>> 86.272334086
>> Mallanna-Regular.ttf
>> Telugu
>> 86.441653183
>> Mandali-Regular.ttf
>> Telugu
>> 86.442915173
>> Dhurjati-Regular.ttf
>> Telugu
>> 86.442915173
>> Ramabhadra-Regular.ttf
>> Telugu
>> 86.443724933
>> 
>>> On Aug 6, 2019, at 11:58 AM, Levantovsky, Vladimir <Vladimir.Levantovsky@monotype.com <mailto:Vladimir.Levantovsky@monotype.com>> wrote:
>>> 
>>> For a glyphID-based model - the first request could simply be the "whole font file" with glyph data zeroed out (which compresses to almost nothing). The subsequent request would patch that with the glyphs that are actually in use.
>>> 
>>> -----Original Message-----
>>> From: mmaxfield@apple.com <mailto:mmaxfield@apple.com> <mmaxfield@apple.com <mailto:mmaxfield@apple.com>> 
>>> Sent: Tuesday, August 6, 2019 12:31 PM
>>> To: Jonathan Kew <jfkthame@gmail.com <mailto:jfkthame@gmail.com>>
>>> Cc: public-webfonts-wg@w3.org <mailto:public-webfonts-wg@w3.org>
>>> Subject: Re: Glyph Closure Scaling
>>> 
>>> 
>>> 
>>>> On Aug 6, 2019, at 2:34 AM, Jonathan Kew <jfkthame@gmail.com <mailto:jfkthame@gmail.com>> wrote:
>>>> 
>>>> On 05/08/2019 22:03, Myles C. Maxfield wrote:
>>>>> I was envisioning the range request model would send an early request for everything in the font other than the outlines. Percentage-wise, this works great for big fonts.
>>>> 
>>>> That's still two separate requests, isn't it? The client needs to make one request to get the font header (which it can assume fits within a predetermined reasonable max size); that will tell it how much it needs to request in order to get everything up to the outlines.
>>> 
>>> I was envisioning the early request wouldn’t be a range request. Instead, it would be a regular request for the whole file, and the browser would parse the bytes as they arrive, and close the connection (or stop requesting or whatever) when the glyph data is reached inside the file. This only makes sense if the glyph data is all at the end of the file.
>>> 
>>> This idea is based loosely on how <video> streaming works, so I should investigate how they solve this particular problem. That being said, I don’t think this approach has to work in any one particular way. We can (and even should!) try a bunch of different related strategies and see which one works the best in practice. 
>>> 
>>>> 
>>>> So there are two complete round-trips to the server *before* it can begin to shape text and determine what glyph ranges it needs to request.
>>>> 
>>>> On a sufficiently low-latency connection that might be fine, but I'm concerned that it could amount to many milliseconds in plenty of real-world cases.
>>>> 
>>>> JK
>>>> 
>>> 
>>> 
>> 
> 

Received on Thursday, 8 August 2019 22:54:46 UTC