Re: [CSS3 Text] 4.2. Hyphenation from Jukka K. Korpela on 2007-08-09 (www-style@w3.org from August 2007)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Thu, 9 Aug 2007 17:36:27 +0300 (EEST)
To: www-style@w3.org
Message-ID: <Pine.SOC.4.64.0708091714050.8708@mustatilhi.cs.tut.fi>
On Thu, 9 Aug 2007, Niklas Åkerlund wrote:

> I figure automatic Hyphenation is a hard thing to implement.

It is, but it has been implemented in many programs for several languages. 
There are problems of varying difficulty, depending on language and on the 
goals, i.e. the desired quality of hyphenation.

> basically need to check each word that's about to wrap if it can be
> broken up. And if so, keep the piece(s) that fits before wrapping.

Well, yes.

> To do that, you'd need a dictionary/list(s) of all words in the
> specific language(s) used that can be broken up and where the break(s)
> should be.

No, this depends on language, and you cannot cover all the words in a 
language, since languages have potentially infinite number of words.

For some languages, hyphenation can be performed mostly algorithmically
(like "break before the last consonant in a consonant cluster inside a 
word") though special cases may require special treatment.

On the other hand, for some languages, reasonable results can be performed 
using a relatively short list of common long words. We do not hyphenate 
everything; just breaking some long words might be OK, if words are 
generally short.

If hyphenation should consider _all_ the possible hyphenation points, then 
things get difficult. Things get even more difficult if typographic 
quality should be considered too, e.g. the principle of avoiding 
hyphenation that leaves just a few letters on the last line of a 
paragraph. Moreover, different possible hyphenation points have different 
acceptability; e.g., a compound word should primarily be hyphenated at the 
component boundary. There's little one can say about such issues in CSS 
even in principle, though imaginably there might be a property that 
indicates the desired quality of hyphenation. (Asking for best quality is 
not always best, since quality may have a high cost in terms of processing 
time, especially if it implies that the browser needs to download extra 
software over the network.)

> I'd rather suggest that browsers contains built in dictionaries,

That's not feasible. Browsers should be able to invoke language-specific 
hyphenation software, and they could even use plugins loaded from the net.

> Manual(or server side implemented) would be alot easier to implement
> for browsers. Text would simply contain hyphenations from the start.

That's possible, of course. You can do that now if you like and you don't 
worry about browsers (mainly Firefox) that don't implement SOFT HYPHEN
yet. It's been possible for a long time. How many authors have used this 
possibility? Not too many. And it has practical problems at present, since 
e.g. Google does not seem to handle the soft hyphen properly: Google 
effectively treats it as word separator.

> However, to keep the specs consistent, a hyphenation character like
> u+200B should only show in preformated text. Just like the newline
> character.

The U+200B character, ZERO-WIDTH SPACE (ZWSP), has nothing to do with 
hyphenation. It allows a line break without_ adding any kind of a hyphen 
or other indicator at the end of a line. It's in theory suitable for 
making non-word strings like URLs or some formulas breakable.

> In html, the BR tag breaks lines. In accordance to this, Mozilla and
> IE implements the non-standard WBR to provide break points inside
> words.

The <wbr> has nothing to do with hyphenation. It's like U+200B except that 
it mostly works and does no harm when it doesn't.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Thursday, 9 August 2007 14:38:30 UTC