Streaming font thoughts from Brian Stell on 2017-11-17 (public-webfonts-wg@w3.org from November 2017)

From: Brian Stell <bstell@google.com>
Date: Thu, 16 Nov 2017 16:50:34 -0800
To: "Levantovsky, Vladimir" <Vladimir.Levantovsky@monotype.com>
Cc: "w3c-webfonts-wg (public-webfonts-wg@w3.org)" <public-webfonts-wg@w3.org>
Message-ID: <CAGD0vg_QmQutnzYs+CjxWVHChbvrwD=HnsAiRWSApx=SWw1cXQ@mail.gmail.com>
Hi Vlad,

Here are some thoughts (from memory) from my experience streaming CJK
fonts, YMMV.

*Preprocessing*
My preprocessing pulls each font apart making a font base (AKA skeleton)
that glyph data can be added back to. To simplify OpenType support I leave
those tables in the base. This does not use much space for Noto Sans CJK
but might for other fonts.

*Download size*
Looking at the Chrome network page reports that the font bases for 7
weights of Noto Sans Simplified Chinese (IIRC about 3.5MB per weight as a
WOFF2, ~14MB total) takes a little under 80KB compressed total. I do not
have any thoughts on how to make this much smaller. My Javascript library
takes a little under 80KB compressed. My rough studies say this could
shrink to 30-40KB if I spent the time to remove the Closure library. The
glyph data for a given page depends on the number of unique
characters/weight pairs but 80KB for a cold start is not an unreasonable
size. The font base data and glyph data are stored in the supersetted font
in IndexedDB for reuse. On subsequent (warm) page loads the new character
data needed is often as little as 5-20KB.

*Memory*
Keeping the font data in Javascript memory is very memory expensive even in
a Uint8Array (IIRC 20MB of font data took 50MB of Javascript memory).
Because of this I do not keep an in-memory copy. For every injection I
fetch the latest version of the font from IndexedDB, inject the new glyph
data, fix up tables, store the supersetted data back, and then pass that
data to the C++ code. To support multiple pages/tabs doing 'concurrent'
updates I lock the DB during the fetch/inject/fix-tables/store operation.

*Character vs glyph*
Javascript can only see the character changes in the DOM. For many fonts
most of the characters map 1-to-1 with the glyphs. However, there are fonts
with more complex mappings. My current system precalculates the closures
and when a character is requested the server returns all related glyphs
(including accents, variants, etc). For Noto Sans CJK this is not an issue.
But some fonts have multiple glyphs per character. If the browser reported
the needed glyphs that could make this download size much smaller for some
fonts; eg, Arabic, Malayalam (IIRC), etc.

*Processing*
The code to watch the DOM for character changes is fairly small and does
not use that much time/CPU.

Injecting new glyph (and other) data into the existing font data using an
array of DataViews is fast.

Passing (transferring) the font data (an array of DataViews) from
Javascript to the C++ code takes significant time (dozens to hundreds of
milliseconds). A hundred or 2 milliseconds may not see like a long time by
itself but when there are 7 weights it really adds up. Because of this I
aggressively compact the client copy of the font as much as possible to
shorten the transfer time. Getting the text on the page not to blink while
the transfer is happening takes a bit of 'trickery'.

*Storage*
Because I use IndexedDB the data is stored per-domain. For example:
https://www.google.com is different from http://www.google.com, is
different from https://mail.google.com, etc. This is a theoretical problem.
Practically, the size efficiency is so good that there can be many streamed
duplicates before reaching any of the subset sizes that I have seen.

*Random thoughts on what my system could benefit from*
Obviously doing everything on the C++ side would be the optimal solution.

Just keeping the font data in the C++ side would solve the transfer time.
Perhaps the data could be behind an opaque Javascript handle that acts like
a ArrayBuffer/DataView. There would need to be some storage mechanism/API.

Keeping the data on the C++ side would (hopefully) solve the transfer-time
/ memory-size and I would not have to compact the supersetted font. The
uncompacted font would have all the space needed for the missing glyph
data. This would make injecting glyph data simply an over-write operation
(the CMAP is slightly trickier but do-able).

To allow multiple pages/tabs to 'concurrently' update the data it would be
critical to have a lock to serialize the data updates. Perhaps there could
be a copy operation that 'locks' the font until it is updated.

Regardless of where the data is, a routine to validate the data would be
really helpful to detect problems.

I think that if there was a reference implementation of the Javascript that
would help the industry adopt and validate this.

A different approach would be for the browser to internally store the font
in a more regular format (like an indexed tables). The Javascript could
supply the core info. When the browser needed more data it could call back
to the Javascript. The Javascript could then take responsibility for
supplying requested data without requiring a new HTTP protocol.

That's all I can think of off the top of my head at the moment.

Brian
Received on Friday, 17 November 2017 00:51:04 UTC