Re: digrams [Re: One font embedding idea] from Andrew C. Bulhak on 1996-08-13 (www-font@w3.org from July to September 1996)

From: Andrew C. Bulhak <acb@cs.monash.edu.au>
Date: Tue, 13 Aug 1996 15:17:28 +1000 (EST)
To: lee@sq.com
Cc: www-font@w3.org, hoefler@typography.com
Message-Id: <199608130517.PAA24959@silas.cc.monash.edu.au>

[lee@sq.com]
> 
> > Also, I don't think it's mathematically possible to dissect even the most 
> > rudimentary English into enough digraphs to no longer require the 
> > presence of discrete glyphs.
> 
> It is possible in principle, but in practice unworkable:
> 
> Suppose there are no more than 300 characters.  Or, say we are red-necked
> American Ascii-users :-) and have only 96 visible characters.
> 
> There are 96 times 96 combinations, giving 9408 glyphs.

Most of which will not be used, and which can be omitted.  Indeed, this
dynamic compilation of fonts would make the embedded fonts noninterchangeable
and thus more secure.

> Note that `w' (for example) occurs 96 times as the 1st in a pair, and
> 96 times as the 2nd, so is transmitted 192 times instead of once in the
> very worse case.

The outline data can be compressed using something like Lempel-Ziv 
encoding, which (coupled with the right representation; unencrypted 
Type 1 bytecode would do) will save a lot of space in such redundant data.

> Now, it turns out that the top 30 to 40 digrams are by far the most common
> in English.  I just did a count on a small (50 MBytes) database of magazine
> articles that has a vocabulary of only about 60,000 words, and found that
> there were approximately 1,281 digraphs used.  Different material has
> different results -- with a single web page, for example http://www.sq.com/,
> there were 93 digraphs.  That's still a lot more to transmit than the
> actual document.
> 
> Of course, this assumes that one of the characters in the digraph is `space'.
> If you discount that, then you can't represent the word "boy", because it
> needs either (bo)y or b(oy), so you have to have a single glyph.  If you
> allow trigraphs as well as digraphs, you can do it, but "a boy or 3 boys"
> involves sending (a) as a glyph, (boy) as a logotype, and (bo)(ys) as
> two digraphs -- clearly inefficient.  

Granted;  this is less efficient as a means of representing documents
than a conventional embedded font.  However, it is more efficient than 
rendering the document into its component curves and line segments.
Also, it doesn't have the problems associated with bitmaps.

> As J.H. pointed out, a document
> like
> 	a b c d e f
> and so on requires the whole normal character set anyway.

But that is a degenerate case.  In any case, trigraphs may be used here.

> There is some merit in using logotypes for "the" and "and", because they
> are so common, bt it'd have to be a fairly large document before you saved
> very much space that way.

This idea won't save space over traditional encoding;  that's not its
purpose.  But it will make it harder and more laborious to rip out fonts
whilst preserving the scalability of the document.

-- 
  http://www.zikzak.net/~acb/       "`HAVE A NICE DAY' died for your sins."
           <acb@dev.null.org>                                  -- Mumbles

Received on Tuesday, 13 August 1996 01:23:03 UTC