- From: Řistein E. Andersen <html5@xn--istein-9xa.com>
- Date: Thu, 11 Jan 2007 14:00:16 +0100
Thanks for all the interesting comments so far. On 9 Jan 2007, at 10:23AM, Anne van Kesteren wrote: > "[...] simple cases could also be handled with a `soft hyphen' (­), if browsers > would only support it." which is of course not an excuse to go around and introduce > a new element! Indeed. Browser support has improved since that document was written, though. Today, all major browsers except Firefox support the soft hyphen, and the purpose of a new element would be to enable more complex cases to be handled properly, not to replace the soft hyphen. On 9 Jan 2007, at 1:3PM, Leons Petrazickis wrote: > Hyphenation is a presentational problem. [...] We should > avoid embedding presentational hyphenation tags in the actual text. Yes, if possible. The verb record is supposed to be hyphenated re-cord, whilst the correct hyphenation of the noun is rec-ord. For this reason, TeX never hyphenates record (unless the author writes rec\-ord or re\-cord). This problem may be more common in other languages, but expecting authors to hard-code hyphenation points in particular words is probably futile. > I would suggest that the first priority is getting a naive hyphenator into browsers. This would probably have to be language-specific, though. See comments on Prince below. > [To handle special cases,] I would suggest a hyphenation dictionary in the > <head> of the document. Not a bad idea. The problem is that the words requiring special attention will depend on the particular ?na?ve? algorithm implemented, i.e., the browser... On 9 Jan 2007, at 1:15PM, Alexey Feldgendler wrote: > In some typographical traditions, non-full-justified text is sometimes > hyphenated. In the mechanical-typewriter era, a typist would certainly choose to hyphenate when the bell sounded in the middle of a long word. On 9 Jan 2007, at 1:37PM, H?kon Wium Lie wrote: > Prince6 (www.princexml.com) supports these properties: > > hyphenate: none | auto > hyphenate-dictionary: none | url(...) > hyphenate-before: <int> > hyphenate-after: <int> > hyphenate-lines: none | <int> >From http://www.princexml.com/howcome/2006/p6/p6demo2.html: > Prince can read the hyphenation format pioneered by TeX and reused by many > other applications. OpenOffice hosts a number of hyphenation dictionaries that > are reusable in Prince6. This is a great step forward. I hope something along these lines will find its way into desktop browsers as well. It should be noted, though, that ? unless I have misunderstood something ? the `hyphenation dictionaries' are really patterns that allow to compute hyphenation points. The particular method used in TeX was discovered by Frank M. Liang about 25 years ago and implemented in TeX soon thereafter. According to the TeXbook, the original US-English patterns find about 90% of the hyphenation points given in a dictionary or about 95% of the permissible hyphenation points in a typical text (where common words are more frequent) without making any mistakes. This is, however, only one part of TeX's hyphenation system. The next level is a hyphenation exception dictionary, a list of fully hyphenated words that would not otherwise be hyphenated correctly. (Plain) TeX contains a list of fourteen words including `present' (which cannot be hyphenated without knowing whether it is a noun or a verb, so TeX does not try) end `ta-ble' (a common word that would otherwise not be hyphenated at all), and the author can add words at any time useing the \hyphenation command. In addition to this, hyphenation can be indicated locally. This is needed in order to hyphenate words like rec-ord/re-cord and is the only level that deals with spelling changes. If The New Yorker were using TeX and wanted pre?mptive to hyphenate as pre-emptive, this rule could not be incorporated into either the patterns or the exception dictionary. From an i18n perspective, the patterns and (at the very least) the exception dictionary ought to allow not only insertion of hyphens, but also spelling changes to be specified. The examples given so far in this thread may not be convincing, but if it is true that l?l should in general hyphenate as l-l in Catalan, this certainly is an important problem for that language, and there are probably many similar issues in other languages that we just do not know about. It seems that Prince currently uses TeX patterns, but no exception dictionary, and allows local encoding of hyphenation points (­), but not spelling changes. There are a few additional caveats. For instance, it is not entirely obvious what should be considered to be a `word' or which characters should be allowed in a `word' (given that only `words' can be hyphenated using this kind of algorithms). TeX uses `category codes' to define letters, and Unicode's character classes give a good approximation, but they cannot be redefined to deal with specific issues. In Italian, for instance, dell'opera should be hyphenated dell'o- pera, but opera should not be hyphenated o-pera. (The particular example may be wrong, but the principle is correct.) Unless the apostrophe is considered to be a `letter' (a constituent of a `word'), correct patterns do not help, as `dell'opera' will not be considered as one unit during hyphenation-point look-up. Another example worth mentioning is that Polish and a few other languages apparently require a hyphenated word like xxx-yyy to be hyphenated xxx- -yyy (with an extra hyphen carried over). A truly flexible system would allow to specify, e.g., which non-letters to treat as part of words and which to give special treatment. (As we all know, TeX hyphenates xxx-yyy as xxx- yyy; in addition, the hyphen prohibits xxx and yyy from being hyphenated, which may or may not be suitable depending on, e.g., column width.) How does Prince deal with these issues? On 9 Jan 2007, at 6:22PM, Henri Sivonen wrote: > * Prince seems to be doing exactly the right thing: control overall hyphenation > with CSS, honor soft hyphens and support TeX-compatible language-specific > dictionaries. Yes, albeit certain specific details could be improved slightly to work better with foreign languages. > * The Swedish and Dutch examples given in this thread seem to be addressable > with language-specific dictionaries. They would be with a suitable pattern and/or dictionary format. > the interaction of the diaeresis with hyphenation may even be a generalizable rule > that could be hard-coded in Dutch-aware hyphenating browsers. Hard-coding such details as opposed to defining a proper format that allows such things easily to be specified occurs to me as a bad idea. > it looks like the Swedish rule is generalizable so that a hyphenator wouldn't even > need a list of all possible compound words but a dictionary of simple words that > can be part of a compound would suffice. Well, yes, but (after verification) `tuggummi' is really composed of the verb `tugga' (with a final -a) and the noun `gummi'. The -a ending disappears in the compound, and `tugggummi' turns into `tuggummi' because triple consonants are not allowed. Hard-coding this level of detail into browsers is probably not ideal. Moreover, a German (alte Rechtschreibung) word like `Bettuch' should be hyphenated `Bet-tuch' or `Bett-tuch' depending on the intended meaning. (I do not argue that such very particular cases require a new HTML element to be added immediately.) > * Not having a language-specific dictionary available in a browser doesn't make > things worse than the status quo, so it isn't that big a deal. Hyphenating using `generic' (US English) rules would actually be worse than abstention, but this is probably not what you mean. > * Hand-coders wouldn't bother to type hyphenation data for everything every time. > * It is unlikely that authoring tools would opt to dump their hyphenation data > in documents I suppose so. An external format would be preferable. > * All the languages cited as requiring spelling changes are written using the Latin script. This may well be due to my and others' cultural bias. On 11 Jan 2007, at 12:50AM, Sander Tekelenburg wrote: > FWIW, my feeling is that it would be best if there'd be a defined format for > hyphenation rules, and browsers would accept such description files as a > plug-in. On 11 Jan 2007, at 1:19AM, H?kon Wium Lie wrote: > This format exists. It was pioneered by TeX and is now widely used by > other applications. You seem to be referring to TeX's hyphenation patterns, which are only one (important) part of TeX's hyphenation system. The missing parts need to be defined somehow, and a certain generalisation would be welcome, as discussed above. -- ?istein E. Andersen
Received on Thursday, 11 January 2007 05:00:16 UTC