- From: <lee@sq.com>
- Date: Thu, 17 Apr 97 13:35:42 EDT
- To: www-html@w3.org, www-style@w3.org
[iswww-style really appropriate for this??] > On Thu, 17 Apr 1997, Vincent QUINT wrote: > A full dictionary for each language would be too much expensive. > Some time ago (in 1983) F. M. Liang proposed a very efficient > method for compressing hyphenation dictionaries while making them > much easier to search. This method is used in TeX and it produces > quite good results with very small dictionaries. This is also the > method used in Amaya. There is a common misconception about TeX's hyphenation which a reader of the above may incorrectly infer. TeX's hypenation is very very good because it very rarely hyphenates words. It goes to a _lot_ of trouble to avoid hyphenation. When it does hyphenate words, it actually does so very very badly. More precisely, usually the hyphenation is acceptable, but often not. We did detailed comparisons at one point when we were working on our SoftQuad troff product years ago. Jo Ossanna's troff algorithm was "right" more often, as I recall, but looked worse because troff hyphenated more words, so there was more opportunity for bad hyphennatioons to be seen! I'm sorry I don't have the exact figures. TeX may well have been improved, and we went with a commercial hyphenation dictionary in the end. In English, some words hyphenate differently depending on how they are used -- e.g. record can be a verb, to re-cord, or a noun, a rec-ord. In German, words can change spelling and get longer when hyphenated. It is necessary to allow authors to override hyphenation, and also to allow supplemental hyphenation dictioonaries for specialist vocabularies -- for example, Proximity has (or used to have) a medical hyphenation dictionary, and I expect so do InSo/HoughtonMifflin. Even very basic hyphenation, however, can lead to a great improvement in browser appearance. My own (unreleased, please don't ask) XView-based browser does hyphenation. Of coourse, most browsers still can't justify lines (pad tme out with spaces so the margins align) and when they start, we'll need to specify whether l e t t e r s p a c i n g is allowed, and how much, and what is the flush zone, and how to treat the last line of a paragraph, and so forth. Where automatic hyphenation is used in a browser, it is important that (1) the automatic hyphen can be distinguished, where necessary, from a hyphenated or compound word, e.g. as in Lady fforbes-Hamilton, who would be rightly offronted if addressed as fforbesHamilton because when you copied the name down you thought the hyphen had been inserted by the browser. This is sometimes done in print (e.g. in some reference works) using a single hyphen in one case and a double hyphen in the other. (2) the copy-to-clipboard function (or Primary Selection on X) must (obviously) copy without the inserted hyphens. (3) you must be able to inhibit hyphenation for a word or phrase -- even if the phrase is broken across lines at word spaces (4) you should be able to specify characters such as or - or / such that a line break may occur after them and a hyphen be inserted or not be inserted (5) you must be able to give optional hyphenation points This is quite a lot. Furthermore, interchange of hyphenation dictionaries between browsers or other HTML agents seems essential. Perhaps the dictionaries could have a canonincal XML representation but be shipped as a compressed trie, for example, giving two levels of interchange. I think there are a lot of issues, but that they are not insoluble. Lee
Received on Thursday, 17 April 1997 13:35:47 UTC