Re: Glyph Closure Scaling from Roderick Sheeter on 2019-08-19 (public-webfonts-wg@w3.org from August 2019)

From: Roderick Sheeter <rsheeter@google.com>
Date: Mon, 19 Aug 2019 10:00:43 -0700
To: Garret Rieger <grieger@google.com>
Cc: "Myles C. Maxfield" <mmaxfield@apple.com>, "Levantovsky, Vladimir" <Vladimir.Levantovsky@monotype.com>, Jonathan Kew <jfkthame@gmail.com>, "public-webfonts-wg@w3.org" <public-webfonts-wg@w3.org>
Message-ID: <CABscrrFtkKRKkA=Mq+X-TFwtRhAdWkvGzZ_aqeFaUk1wzXO=vg@mail.gmail.com>
The problem Google Fonts ran into with Indic is that there are characters,
most notably the danda, that exist in multiple scripts. As Garret noted, we
didn't want to force users to download all Indic scripts to use one because
the fonts are fairly large.

If we split Indic scripts with unicode-range then we get something like
this:

<style
  /* font face for Bengali, unicode-range includes danda */
  /* font face for Telugu, unicode-range includes danda */
</style>
<body>
  [Bengali text that uses danda ... which will flip to the Telugu font for
the danda (and any other shared character)]
</body>

In the sketch above the danda for Telugu is always used because it's
the last declared and thus the highest priority supporting face. Rendering
Bengali text will use the Telugu danda and snap any layout features
involving it.

Related, there was once a proposal about script-specific danda (
http://unicode.org/~emuller/iwg/p1/utcdoc.html).



On Thu, Aug 8, 2019 at 4:43 PM Garret Rieger <grieger@google.com> wrote:

> Re: indic family. I think you misunderstood the use case, It's not split
> for legal reasons. It was originally authored as a single family (Baloo)
> that covered several Indic scripts. We split it up for serving into one
> family per script (family name is Baloo + script name) because rendering
> can be broken if it's served as a set of script subsets via unicode range
> and the combined single font was too large to serve as one piece. This is
> not ideal for developers using our service since they need to figure out
> what script they need and select the right family for their use case. With
> PFE we can serve it in it's original form as a single family (named Baloo
> as originally intended) and have PFE get the right part of the family based
> on which script is being used on a page.
>
> On Thu, Aug 8, 2019 at 3:54 PM Myles C. Maxfield <mmaxfield@apple.com>
> wrote:
>
>>
>>
>> On Aug 8, 2019, at 2:58 PM, Garret Rieger <grieger@google.com> wrote:
>>
>> Thanks for running this analysis over the Google Fonts collection. A few
>> thoughts:
>>
>>    - Makes sense to me that it's most arabic/indic that appears to have
>>    poor scaling.
>>    - You did your analysis by looking at glyph counts, I wonder if the
>>    results would change much if it instead counted total glyph bytes in the
>>    closure.
>>
>> The choice of counting glyphs rather than bytes was intentional (indeed,
>> in an early version of the analysis, I was measuring file sizes, but then
>> changed it to count glyphs instead). The goal of this experiment is to
>> determine how tightly-connected certain scripts / fonts are, which is a
>> distinct concept from the distribution of from the visual complexity /
>> number of control point outlines across a font. I’m trying to measure
>> attributes of script shaping, not about outlines.
>>
>> That being said, I intuitively wouldn’t expect the data to change much if
>> we measure bytes. But I’m unlikely to repeat this analysis since it kept my
>> gaming machine on at 100% CPU load with the fans screaming for a couple
>> days straight. I used this script
>> <https://github.com/litherum/SubsetScaling/blob/master/SubsetScaling/SubsetScaling.py> to
>> run the analysis, and anyone else should feel free to modify it to measure
>> any other metric.
>>
>>
>>    - For the closure did you use the default layout feature selection,
>>    or --layout_features=*?
>>
>> I didn’t touch the set of layout features.
>>
>>
>>    - Something I think that's missing from this analysis is a comparison
>>    to what would happen if we were instead requesting just the glyphs we
>>    needed that came from the result of shaping. That analysis could look
>>    something like this:
>>       - Get sample text that uses codepoints from a font.
>>       - Compute the glyph closure on those codepoints.
>>       - Compute the exact set of glyphs needed for that specific
>>       sequence of text (should be smaller then the full closure).
>>
>> Right, this analysis didn’t consider sequences of glyphs; it only
>> subsetted glyphs 0-n. I wanted to see how the glyph sets scale through the
>> font, which means I would have had to pick many representative strings of
>> various lengths, ending up with a string that covers every glyph in the
>> file. This seemed too difficult to find/generate, so I opted for a simpler
>> approach which should still be able to answer the question.
>>
>>
>>    - Compare the difference between those two. That delta represents the
>>       unnecessary glyph data that would be sent if we requested codepoints vs
>>       glyph ids.
>>       - I suspect for things like Arabic and Indic, that using either
>>       method you'll end up with a majority of the font for any medium to large
>>       size piece of text.
>>    - For Indic even if within the individual scripts we don't get much
>>    benefit from PFE it's still valuable to use. We have several families which
>>    we have to serve each script as a separate family (for example:
>>    https://fonts.google.com/?query=baloo).
>>
>> For legal reasons?
>>
>>
>>    - Ideally we'd like to serve this as a single font which contains all
>>    the scripts and is progressively enriched
>>
>> I believe this opens up a whole new can of worms. What font family does
>> the CSS use? How does the page author know which fonts the server will
>> actually end up replying with? How does that work for attribution & page
>> load pricing of fonts? If the reason you can’t serve a single font is
>> because of legal reasons, it’s unclear whether or not serving
>> technically-different-yet-behaves-to-users-as-if-it-is-a-single-font would
>> actually alleviate these reasons.
>>
>>
>>    - so that users need only download the data for the script(s) they
>>    need. Usual techniques like unicode range won't work since there are shared
>>    codepoint between the scripts and you can get broken rendering if scripts
>>    are mixed (unicode range selects the shared codepoint from the wrong file).
>>
>>
>> On Wed, Aug 7, 2019 at 12:25 PM Myles C. Maxfield <mmaxfield@apple.com>
>> wrote:
>>
>>>
>>>
>>> On Aug 7, 2019, at 12:03 PM, Myles C. Maxfield <mmaxfield@apple.com>
>>> wrote:
>>>
>>> Here’s the data from Google Fonts:
>>>
>>> <Screen Shot 2019-08-06 at 10.24.13 PM.png>
>>>
>>> <Screen Shot 2019-08-06 at 10.03.43 PM.png>
>>>
>>> <Screen Shot 2019-08-06 at 10.05.23 PM.png>
>>>
>>>
>>> The print is kind of small, but the X axis is pretty interesting.
>>>
>>> Here’s the same chart, but zoomed in so that the right edge of the graph
>>> is at 1 megabyte (which would still be a sizable webfont)
>>>
>>> <Screen Shot 2019-08-07 at 12.23.15 PM.png>
>>>
>>> Things look significantly worse.
>>>
>>>
>>> Looks a bit more difficult than the Windows fonts.
>>>
>>> *NTR-Regular.ttf*
>>> *Telugu*
>>> 78.736248482
>>> *Lohit-Bengali.ttf*
>>> *Bengali*
>>> 78.831622569
>>> *TenaliRamakrishna-Regular.ttf*
>>> *Telugu, Latin*
>>> 81.910529391
>>> *Peddana-Regular.ttf*
>>> *Telugu, Latin*
>>> 82.017367
>>> *Ramaraja-Regular.ttf*
>>> *Telugu, Latin*
>>> 82.060643793
>>> *Ponnala-Regular.ttf*
>>> *Telugu*
>>> 82.925248156
>>> *Sitara-Regular.ttf*
>>> *Devanagari, Latin*
>>> 83.176943573
>>> *Sitara-Bold.ttf*
>>> *Devanagari, Latin*
>>> 83.176943573
>>> *Sitara-BoldItalic.ttf*
>>> *Devanagari, Latin*
>>> 83.186576710
>>> *Sitara-Italic.ttf*
>>> *Devanagari, Latin*
>>> 83.186576710
>>> *Amiri-Italic.ttf*
>>> *Arabic*
>>> 83.235126624
>>> *Amiri-BoldItalic.ttf*
>>> *Arabic*
>>> 83.304386938
>>> *Amiri-Regular.ttf*
>>> *Arabic*
>>> 83.363365799
>>> *Amiri-Bold.ttf*
>>> *Arabic*
>>> 83.39951432
>>> *SreeKrushnadevaraya-Regular.ttf*
>>> *Telugu*
>>> 85.420147454
>>> *Suranna-Regular.ttf*
>>> *Telugu*
>>> 85.4847986935
>>> *Taprom.ttf*
>>> *Khmer*
>>> 85.498475514
>>> *Angkor-Regular.ttf*
>>> *Khmer*
>>> 85.498475514
>>> *Timmana-Regular.ttf*
>>> *Telugu*
>>> 85.7927372355
>>> *Chathura-ExtraBold.ttf*
>>> *Telugu, Latin*
>>> 86.099921648
>>> *Chathura-Regular.ttf*
>>> *Telugu, Latin*
>>> 86.099921648
>>> *Chathura-Bold.ttf*
>>> *Telugu, Latin*
>>> 86.099921648
>>> *Chathura-Thin.ttf*
>>> *Telugu, Latin*
>>> 86.099921648
>>> *Chathura-Light.ttf*
>>> *Telugu, Latin*
>>> 86.099921648
>>> *Bokor-Regular.ttf*
>>> *Khmer*
>>> 86.153956081
>>> *Moul.ttf*
>>> *Khmer*
>>> 86.153956081
>>> *Siemreap.ttf*
>>> *Khmer*
>>> 86.153956081
>>> *Dangrek.ttf*
>>> *Khmer*
>>> 86.153956081
>>> *Metal.ttf*
>>> *Khmer*
>>> 86.153956081
>>> *Moulpali.ttf*
>>> *Khmer*
>>> 86.153956081
>>> *Content-Bold.ttf*
>>> *Khmer*
>>> 86.153956081
>>> *Content-Regular.ttf*
>>> *Khmer*
>>> 86.153956081
>>> *Freehand.ttf*
>>> *Khmer*
>>> 86.153956081
>>> *Siemreap.ttf*
>>> *Khmer*
>>> 86.153956081
>>> *Koulen.ttf*
>>> *Khmer*
>>> 86.153956081
>>> *Preahvihear.ttf*
>>> *Khmer*
>>> 86.272334086
>>> *Bayon-Regular.ttf*
>>> *Khmer*
>>> 86.272334086
>>> *Chenla.ttf*
>>> *Khmer*
>>> 86.272334086
>>> *OdorMeanChey.ttf*
>>> *Khmer*
>>> 86.272334086
>>> *Mallanna-Regular.ttf*
>>> *Telugu*
>>> 86.441653183
>>> *Mandali-Regular.ttf*
>>> *Telugu*
>>> 86.442915173
>>> *Dhurjati-Regular.ttf*
>>> *Telugu*
>>> 86.442915173
>>> *Ramabhadra-Regular.ttf*
>>> *Telugu*
>>> 86.443724933
>>>
>>> On Aug 6, 2019, at 11:58 AM, Levantovsky, Vladimir <
>>> Vladimir.Levantovsky@monotype.com> wrote:
>>>
>>> For a glyphID-based model - the first request could simply be the "whole
>>> font file" with glyph data zeroed out (which compresses to almost nothing).
>>> The subsequent request would patch that with the glyphs that are actually
>>> in use.
>>>
>>> -----Original Message-----
>>> From: mmaxfield@apple.com <mmaxfield@apple.com>
>>> Sent: Tuesday, August 6, 2019 12:31 PM
>>> To: Jonathan Kew <jfkthame@gmail.com>
>>> Cc: public-webfonts-wg@w3.org
>>> Subject: Re: Glyph Closure Scaling
>>>
>>>
>>>
>>> On Aug 6, 2019, at 2:34 AM, Jonathan Kew <jfkthame@gmail.com> wrote:
>>>
>>> On 05/08/2019 22:03, Myles C. Maxfield wrote:
>>>
>>> I was envisioning the range request model would send an early request
>>> for everything in the font other than the outlines. Percentage-wise, this
>>> works great for big fonts.
>>>
>>>
>>> That's still two separate requests, isn't it? The client needs to make
>>> one request to get the font header (which it can assume fits within a
>>> predetermined reasonable max size); that will tell it how much it needs to
>>> request in order to get everything up to the outlines.
>>>
>>>
>>> I was envisioning the early request wouldn’t be a range request.
>>> Instead, it would be a regular request for the whole file, and the browser
>>> would parse the bytes as they arrive, and close the connection (or stop
>>> requesting or whatever) when the glyph data is reached inside the file.
>>> This only makes sense if the glyph data is all at the end of the file.
>>>
>>> This idea is based loosely on how <video> streaming works, so I should
>>> investigate how they solve this particular problem. That being said, I
>>> don’t think this approach has to work in any one particular way. We can
>>> (and even should!) try a bunch of different related strategies and see
>>> which one works the best in practice.
>>>
>>>
>>> So there are two complete round-trips to the server *before* it can
>>> begin to shape text and determine what glyph ranges it needs to request.
>>>
>>> On a sufficiently low-latency connection that might be fine, but I'm
>>> concerned that it could amount to many milliseconds in plenty of real-world
>>> cases.
>>>
>>> JK
>>>
>>>
>>>
>>>
>>>
>>>
>>
Received on Monday, 19 August 2019 17:01:21 UTC