- From: Liam R E Quin <liam@w3.org>
- Date: Sat, 23 Mar 2013 00:22:18 -0400
- To: www-style list <www-style@w3.org>
It's a mistake to define hyphenation as in css text level 3 without saying how hyphenated words behave. Luckily some of the necessary text has already been written for css text level 4. Some things that need to be clarified: . hyphenation is a property of rendering, not of the DOM (disregarding shadow DOMs for a moment) - a search for "barefoot" must work even if the word has been hyphenated as bare- foot, and if the text is reflowed, e.g. because of a change in viewport size, the word may be hyphenated differently or not at all in subsequent renderings. . soft hyphen characters must not affect search: they are to be ignored in both search strings and document text. . ASCII hyphen ("-") can be used as a break character, as can the soft hyphen. Breaking at ­ must insert "-" from the current font. Note: level 4 proposes a custom hyphenation character. I think a selector approach might be better, as then colour and/or an image could be used, e.g. a picture of a curved arrow in code listings, offset from the text. . A renderer is never required to hyphenate, even if a single "word" is longer than the available space. Existing overflow strategies can be used. . A user agent or renderer MAY add a preference to allow users to enable hyphenation by default for any text in their language, or any text not specifically marked for language; there should also be an option to disable hyphenation altogether. The next step (level 4) should include a hyphenation exception mechanism. A way to use TeX pattern files may also be useful, but today hyphen.js, hyphenate.js etc. can add soft hyphens at every break point for many languages (not German!), and can work around incompatible browser behaviour with respect to soft hyphens, searching, reflowing text etc. What I'm trying to do here is (1) push for higher quality, and (2) push for higher interoperability. Right now hyphenation tends to break stuff or to behave too differently across browsers for even the JavaScript shims to be acceptable. Accessibility can also suffer, e.g. soft hyphens are said to be (incorrectly) rendered as spaces in some browsers. So we need to give more guidance (unless my experiments and the research I did are out of date, which is always possible since I blinked in the meantime!) Liam PS: TeX's hyphenation algorithm is not the best (as even the TeXBok acknowledges). TeX is not considered a "high end" formatter by people who do large amounts of batch/unattended high-quality formatting, and its poor hyphenation algorithm and its unacceptable treatment of corner cases are a large part of the reason; TeX is fine, excels, in semi-automated formatting, e.g. for research papers, where the author will correct problems. The advantage of the TeX pattern algorithm and interchange format format (which seems to be closely modelled on the older troff algorithm) is that it's widely described and is much more compact than the dictionary-based systems. The best results for most Western languages are a mix of an algorithm and a dictionary; some languages, such as Thai and German, are much harder than others. PPS: I wrote a lot more detailed comments on hyphenation and line breaking but am guessing that I need to save them until the WG has cycles to process them for level 4. But the comments here apply to level 3. -- Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/ Pictures from old books: http://fromoldbooks.org/ Ankh: irc.sorcery.net irc.gnome.org freenode/#xml
Received on Saturday, 23 March 2013 04:22:20 UTC