- From: Řistein E. Andersen <html5@xn--istein-9xa.com>
- Date: Tue, 09 Jan 2007 00:02:54 +0100
Hyphenation does not seem to have been discussed on this list so far, and I think it should be. General discussion: [1] http://www.w3.org/International/O-HTML-hyphenation.html Old proposal: [2] http://www.nada.kth.se/i18n/html/hyph.html Babel (LaTeX i18n package) documentation: [3] ftp://tug.ctan.org/pub/tex-archive/macros/latex/required/babel/user.pdf Unicode Technical Report #14 -- Line Breaking Properties: [4] http://www.unicode.org/reports/tr14/tr14-6.html In summary, hyphenation is a hard problem: breaking points cannot in general be established algorithmically; hyphenation dictionaries are not always available and typically do not contain long/rare/complex words (the ones that really need to be hyphenated); furthermore, distinct words may be spelt identically, but still need to be hyphenated differently; and several languages require spelling changes when words are hyphenated ([3] mentions Dutch, German (alte Rechtschreibung), Spanish, Norwegian, Swedish and Hungarian). The controversy surrounding the meaning of ­ (U+00AD) is probably over, although Opera currently seems not to render this character in accordance with Unicode (IE7 and Safari seem to do the right thing; Firefox does not hyphenate at all). [4] contains the following passage: > SHY is rendered invisibly and has no width, except at a line break. The > rendering of the soft hyphen depends on the script. For the Latin script > it is rendered as a hyphen, however, some languages require a change > in spelling surrounding an optional hyphen, if it occurs at a line break. > For example in Swedish the word ?tuggummi? changes to ?tugg-gummi? > when hyphenated. It is not clear to me how this last point is supposed to be implemented in practice, however. (It is certainly n o t the case that `gg' should be hyphenated `gg- g' in a l l Swedish words.) The proposal [2] suggests the addition of a new <hyph> element, modelled after TeX's \discretionary command (with a possibly superfluous addition), that permits to specify which characters to render before/after a line break if the word is broken. Currently, hyphenation and justification are scarce on the Web, and the average blogger hardly misses these features. If, however, writing books in HTML (as mentioned on this list) is to become commonplace, these issues must be dealt with somehow, and explicit markup seems to be unavoidable at least in some cases. I hope this can lead to a fruitful discussion. -- ?istein E. Andersen
Received on Monday, 8 January 2007 15:02:54 UTC