digrams [Re: One font embedding idea] from lee@sq.com on 1996-08-13 (www-font@w3.org from July to September 1996)

From: <lee@sq.com>
Date: Mon, 12 Aug 96 21:38:52 EDT
To: www-font@w3.org, hoefler@typography.com
Message-Id: <9608130138.AA16608@sqrex.sq.com>

> Also, I don't think it's mathematically possible to dissect even the most 
> rudimentary English into enough digraphs to no longer require the 
> presence of discrete glyphs.

It is possible in principle, but in practice unworkable:

Suppose there are no more than 300 characters.  Or, say we are red-necked
American Ascii-users :-) and have only 96 visible characters.

There are 96 times 96 combinations, giving 9408 glyphs.
Note that `w' (for example) occurs 96 times as the 1st in a pair, and
96 times as the 2nd, so is transmitted 192 times instead of once in the
very worse case.

Now, it turns out that the top 30 to 40 digrams are by far the most common
in English.  I just did a count on a small (50 MBytes) database of magazine
articles that has a vocabulary of only about 60,000 words, and found that
there were approximately 1,281 digraphs used.  Different material has
different results -- with a single web page, for example http://www.sq.com/,
there were 93 digraphs.  That's still a lot more to transmit than the
actual document.

Of course, this assumes that one of the characters in the digraph is `space'.
If you discount that, then you can't represent the word "boy", because it
needs either (bo)y or b(oy), so you have to have a single glyph.  If you
allow trigraphs as well as digraphs, you can do it, but "a boy or 3 boys"
involves sending (a) as a glyph, (boy) as a logotype, and (bo)(ys) as
two digraphs -- clearly inefficient.  As J.H. pointed out, a document
like
	a b c d e f
and so on requires the whole normal character set anyway.

There is some merit in using logotypes for "the" and "and", because they
are so common, bt it'd have to be a fairly large document before you saved
very much space that way.


Note also that what I was suggesting with the encodings was a way of making
the font moderately secure on its way to the printer -- the actual
web page would still be in ISO 8859-1 (Latin 1) as now.  Strictly speaking
it'd be possible to write software to reverse-engineer given the plain text
document and the `encrypted' version.  But it would also be possible to
write a `malicious' web browser that saved unencrypted fonts.  All you
can hope to do is be secure against most people, not all people.

The FUSE `loss of faith' font (I think that's the one) is an example of
what you can do like this...

Lee

Received on Monday, 12 August 1996 21:39:38 UTC