Re: Hyphenation (was Re: A suggested tag)

[iswww-style really appropriate for this??]

> On Thu, 17 Apr 1997, Vincent QUINT wrote:
> A full dictionary for each language would be too much expensive.
> Some time ago (in 1983) F. M. Liang proposed a very efficient
> method for compressing hyphenation dictionaries while making them
> much easier to search. This method is used in TeX and it produces
> quite good results with very small dictionaries. This is also the
> method used in Amaya.

There is a common misconception about TeX's hyphenation which a reader
of the above may incorrectly infer.

TeX's hypenation is very very good because it very rarely hyphenates words.
It goes to a _lot_ of trouble to avoid hyphenation.

When it does hyphenate words, it actually does so very very badly.
More precisely, usually the hyphenation is acceptable, but often not.
We did detailed comparisons at one point when we were working on our
SoftQuad troff product years ago.

Jo Ossanna's troff algorithm was "right" more often, as I recall, but
looked worse because troff hyphenated more words, so there was more
opportunity for bad hyphennatioons to be seen!

I'm sorry I don't have the exact figures.  TeX may well have been improved,
and we went with a commercial hyphenation dictionary in the end.

In English, some words hyphenate differently depending on how they are
used -- e.g. record can be a verb, to re-cord, or a noun, a rec-ord.

In German, words can change spelling and get longer when hyphenated.

It is necessary to allow authors to override hyphenation, and also to
allow supplemental hyphenation dictioonaries for specialist vocabularies --
for example, Proximity has (or used to have) a medical hyphenation
dictionary, and I expect so do InSo/HoughtonMifflin.

Even very basic hyphenation, however, can lead to a great improvement in
browser appearance.  My own (unreleased, please don't ask) XView-based
browser does hyphenation.  Of coourse, most browsers still can't justify
lines (pad tme out with spaces so the margins align) and when they start,
we'll need to specify whether l e t t e r    s p a c i n g   is allowed,
and how much, and what is the flush zone, and how to treat the last line
of a paragraph, and so forth.

Where automatic hyphenation is used in a browser, it is important that
(1) the automatic hyphen can be distinguished, where necessary, from a
    hyphenated or compound word, e.g. as in Lady fforbes-Hamilton, who
    would be rightly offronted if addressed as fforbesHamilton because
    when you copied the name down you thought the hyphen had been inserted
    by the browser.  This is sometimes done in print (e.g. in some
    reference works) using a single hyphen in one case and a double
    hyphen in the other.

(2) the copy-to-clipboard function (or Primary Selection on X) must
    (obviously) copy without the inserted hyphens.

(3) you must be able to inhibit hyphenation for a word or phrase -- even
    if the phrase is broken across lines at word spaces

(4) you should be able to specify characters such as  or - or / such that
    a line break may occur after them and a hyphen be inserted or not be

(5) you must be able to give optional hyphenation points

This is quite a lot.  Furthermore, interchange of hyphenation dictionaries
between browsers or other HTML agents seems essential.  Perhaps the
dictionaries could have a canonincal XML representation but be shipped
as a compressed trie, for example, giving two levels of interchange.

I think there are a lot of issues, but that they are not insoluble.