[whatwg] Hyphenation

On Jan 9, 2007, at 01:02, ?istein E. Andersen wrote:

> In summary, hyphenation is a hard problem: breaking points cannot  
> in general
> be established algorithmically; hyphenation dictionaries are not  
> always available
> and typically do not contain long/rare/complex words (the ones that  
> really
> need to be hyphenated); furthermore, distinct words may be spelt  
> identically,
> but still need to be hyphenated differently; and several languages  
> require spelling
> changes when words are hyphenated ([3] mentions Dutch, German (alte
> Rechtschreibung), Spanish, Norwegian, Swedish and Hungarian).

My initial thoughts:

  * Prince seems to be doing exactly the right thing: control overall  
hyphenation with CSS, honor soft hyphens and support TeX-compatible  
language-specific dictionaries.

  * The Swedish and Dutch examples given in this thread seem to be  
addressable with language-specific dictionaries.

  * Not knowing Dutch, the example makes me guess that the diaeresis  
in Dutch has the same meaning as in French (indicate that vowels  
don't form a diphthong). If this is the case, the interaction of the  
diaeresis with hyphenation may even be a generalizable rule that  
could be hard-coded in Dutch-aware hyphenating browsers. Is it a  
generalizable rule?

  * Knowing a bit Swedish, I really have a hard time taking seriously  
the notion of Swedish requiring new markup to be introduced to HTML.  
The sky won't fall if a browser doesn't know how to hyphenate Swedish  
chewing gum in the absence of a hyphenation dictionary. (Besides, it  
looks like the Swedish rule is generalizable so that a hyphenator  
wouldn't even need a list of all possible compound words but a  
dictionary of simple words that can be part of a compound would  
suffice.)

  * Not having a language-specific dictionary available in a browser  
doesn't make things worse than the status quo, so it isn't that big a  
deal.

  * Hand-coders wouldn't bother to type hyphenation data for  
everything every time. (TeX users run the typesetting step themselves  
whereas HTML is rendered elsewhere. TeX users only tend to  
micromanage the words that they see didn't typeset nicely.)

  * It is unlikely that authoring tools would opt to dump their  
hyphenation data in documents even if their data was in a format  
suitable for dumping in whatever format was required.

  * All the languages cited as requiring spelling changes are written  
using the Latin script. The Latin script has a long cultural  
tradition of adapting to writing technology: from chiseled marble to  
quills to movable type to typewriters to computer displays.  
Therefore, I don't find it unreasonable to suggest adapting to the  
limitations of the medium here.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/

Received on Tuesday, 9 January 2007 10:22:28 UTC