Re: TrueType Collections from Raph Levien on 2014-02-26 (public-webfonts-wg@w3.org from February 2014)

From: Raph Levien <raph@google.com>
Date: Wed, 26 Feb 2014 11:33:33 -0800
To: Christopher Slye <cslye@adobe.com>
Cc: "public-webfonts-wg@w3.org Group" <public-webfonts-wg@w3.org>
Message-ID: <CAFQ67bO5y-491uVKfK3H_j=b0NpUC_vK+giCR_aG=4cCDc=oBA@mail.gmail.com>
I look forward to hashing this out some more in the call this afternoon. As
I made clear last week, my personal feeling is toward keeping things
simple, but totally open to having the conversation.

I've been thinking about the technical issues some more, especially after
some very good suggestions by Vlad in the call last week. To make the
discussion a bit more concrete, here is a strawman proposal for what
TTC/OTC collections might look like combined with the WOFF2 ideas.

First, we'd need a flag early in the headers to indicate that the file is a
collection. The traditional sfnt technique would signal this in the first 4
bytes, but it wouldn't have to be that way.

In the case it's not a collection, we make no changes to the existing wire
format.

If it is a collection, then the structure of the file is:

WOFF header
combined table directory
collection tables (see below)
brotli-compressed data
(optional) metadata

This differs from the single-font case by inserting the "collection
tables", and by tweaking table directory into a "combined" table directory.
This is essentially the same thing, but allows table tags to be repeated.
It also enforces glyf and loca tables must be paired. In other words, after
stripping all other table types, the tags must follow the regular
expression (glyf, loca)*. Keep in mind that a transformed loca is largely a
placeholder, indicating how much size is required in the reconstructed
font, but consuming no bytes of the data stream.

The collection tables have the following format:

collectionTable:
255USHORT nFonts
fontTable[nFonts]

fontTable:
255USHORT nTables
255USHORT index[nTables]

The index refers to a table in the combined table directory, simply
counting up from 0. Thus, in the simple case of a font with 10 tables, the
collection table would be the sequence: 1 10 0 1 2 3 4 5 6 7 8 9. A typical
collection with two fonts, sharing all tables but one would be: 2 10 0 1 2
3 4 5 6 7 8 9 10 0 1 2 10 4 5 6 7 8 9.

This would allow stream-based processing, would not allow additional
overlaps, and would reuse much of the mechanism (and security review) of
the existing format.

Presenting this proposal is independent of whether I think it's a good
idea. But I thought it might be helpful to have a concrete proposal (with
some engineering details) so we can talk in more detail about the tradeoffs.

Looking forward to the call, it should be an interesting discussion,

Raph



On Tue, Feb 25, 2014 at 2:02 PM, Christopher Slye <cslye@adobe.com> wrote:

> Having now spoken to Ken Lunde and David Lemon, I will offer some thoughts
> here on behalf of Adobe:
>
> Adobe would like to see TTC/OTC have "equal citizenship" and support
> wherever fonts are used. Although we haven't yet seen a large real-world
> presence of such fonts in use, we think the increased availability of tools
> (such as those recently developed and released by Adobe) will bolster
> awareness, development and use. It's a format Adobe intends to support and
> build upon for numerous practical reasons, so we think support on the web,
> as a matter of principle, is the right thing to do.
>
> Having said that, we can't disagree with the prevailing WG opinion, which
> is that TTC/OTC, today, is an awkward format for web delivery, and that the
> engineering effort to add support seems out of balance with the benefits.
> The question which remains for us is whether it's better to invest in
> engineering effort now while we have the chance. Is the consensus opinion
> in the WG that deferring support today is not passing up the best chance to
> support TTC/OTC in WOFF? In other words, what are the chances of getting
> TTC/OTC support on the web in one, two, or three years?
>
> -Christopher
>
>
> On Feb 10, 2014, at 10:37 AM, Raph Levien <raph@google.com> wrote:
>
> > Hi WG folks,
> >
> >    One of the remaining technical questions to consider is whether to
> add support for functionality equivalent to TrueType collections to the
> format. In reviewing the pros and cons, I think there's a pretty strong
> case for _not_ including TTC's, and I think it would be useful to set down
> my thoughts.
> >
> >    First, on the pro side, I wanted to analyze the use cases. The main
> engineering question is how much file size saving is possible from serving
> multiple fonts with some shared tables. I've heard two use cases that are
> compelling (it is of course possible I'm missing more). The first is
> multiple styles of a complex script font, with all styles sharing a GSUB
> table, and the second is Han unification in CJK.
> >
> >    A font family has to be carefully designed for all styles to have the
> same GSUB. In particular, glyph numbering has to be consistent across the
> styles (of course, this means that cmap can be shared as well). I believe
> that in general it doesn't make sense for multiple styles to share GPOS,
> as, in high quality designs, mark positioning will be adjusted for the
> weights. I looked at a bunch of complex script fonts and found that only in
> Noto Sans Devanagari was the relative size of the GSUB table significant
> (it is about 34k out of 125k). However, in the existing design, the regular
> and bold weights are not glyph-compatible - the font would need to be
> reengineered to take advantage of such an optimization. In other indic
> scripts I looked at, the GSUB size is less (Noto Sans Kannada is 5k out of
> 78k), and in other complex scripts _much_ less (Droid Sans Naskh is 2k out
> of 89k, and Thai is 294 bytes out of 21k).
> >
> >    The other use case is packaging CJK fonts specialized to different
> locales (simplified Chinese, traditional Chinese, and Japanese) in the same
> font file. Two observations here: in Web use, it is unusual to require
> multiple CJK appearances for the same font in the same web page. Exceptions
> do exist, for example dictionaries. Second, the OpenType variant mechanism
> is a more modern approach to the same problem. In addition, using OpenType
> variants is much easier for compatibility - if a browser doesn't support
> them, you still see completely valid CJK.
> >
> >    So my conclusion is that there are valid use cases but that they are
> not compelling - in practice, you'd only see significant savings for a tiny
> fraction of web pages.
> >
> >    On the "con" side I was concerned about spec complexity and security
> implications. A more minor concern was format compatibility (we have
> prototype implementations). It would be nice to not break compatibility,
> but that said, if there were a real advantage to changing the format, it
> would be worthwhile.
> >
> >    The existing draft basically treats the compressed data as a stream.
> In a minimal memory footprint environment, it would allow for decompressing
> a font file in a stream-based, incremental fashion, for the most part. The
> exception is filling in the checksum values, which would require going back
> and modifying the header after all tables are processed. However, for many
> applications the checksums can be considered optional.
> >
> >    (One point that I observed while digging into this, not directly
> relevant to the TTC question but perhaps interesting, is that to enable
> minimal memory footprint streaming, we'd have to enforce that the loca
> table follows glyf. This seems reasonable enough to me that I believe I
> will add it as a requirement for compressors in the spec)
> >
> >    A compressed file, by contrast, wouldn't be represented by a sequence
> of tables, each with a size (as is the present format). Rather, the most
> natural representation would be (offset, length) pair references to the
> uncompressed block. A straightforward implementation would just decompress
> the entire block, then extract tables using these references. Of course,
> most actual files would reduce to the streamable case, but having a
> separate code path to analyze that and use more efficient processing sounds
> (a) more complex, and (b) risky in terms of opening more potential security
> problems. Already, OpenType Sanitizer does extensive checking to validate
> that tables don't overlap, etc. Such checking is not necessary in the
> stream case (though of course sizes still need to be validated).
> >
> >    Thus, my conclusion is that the costs in terms of complexity and
> potential security risk are nontrivial. Thus, I believe we should not try
> to standardize a method for font file collections with shared tables as
> part of WOFF.
> >
> >    Very happy to hear discussion, especially if I've missed something.
> >
> > Raph
> >
>
>
>
Received on Wednesday, 26 February 2014 19:34:00 UTC