[css3-text][css3-gcpm] Word Breaking / Hyphenation from Christoph Päper on 2009-04-10 (www-style@w3.org from April 2009)

From: Christoph Päper <christoph.paeper@crissov.de>
Date: Fri, 10 Apr 2009 21:14:51 +0200
To: CSS 3 W3C Group <www-style@w3.org>
Message-Id: <990501A1-0409-40B3-A3DB-06543E349E35@crissov.de>
While the |word-break| property is at risk for level 3 of the CSS  
Text module
   <http://www.w3.org/TR/css3-text/#line-breaking> /
   <http://dev.w3.org/csswg/css3-text/#line-breaking>
and hyphenation, currently located in the all-purpose module CGPM
   <http://www.w3.org/TR/css3-gcpm/#hyphenation> /
   <http://dev.w3.org/csswg/css3-gcpm/#hyphenation>,
is quite open to discussion, I would like to raise some points.

== Kinds of word breaking ==
The current hyphenation proposal distinguishes in its |hyphens|  
property three kinds of hyphenation: never ('none'), as indicated  
('manual') and by algorithm or database ('auto').

For some languages, such as German, or scripts, probably CJK, there  
is a state between 'manual' and 'auto' that would be useful: break  
compounds. It is less complex to implement than a fully syllabic or  
morphemic algorithm (at least for alphabetic scripts) and it  
sometimes is the preferred style in titles or headings.

   Zeilen-|Trennung    manual break point (|)
   Zeilen<ZW>|trennung manual break point
   Zeilen<SH>·trennung manual hyphenation point (·)
   Zeilentrennung      lexemic, keep words together, no hyphenation  
or breaks
   Zeilen·trennung     sememic, hyphenate compounds
   Zeil·en·trenn·ung   morphemic, hyphenate at grammatical boundaries
   Zei·len·tren·nung   syllabic, hyphenate at articulatory boundaries
   Z.ei.l.e            graphemic, keep diphthongs, ligatures etc.  
together
   Z.e.i.l.e           graphetic, split between characters/letters
   Z.e.ı.°.l.e         glyphic, decompose characters if possible/ 
necessary

Morphemic and syllabic should be mutually exclusive and usually a  
language uses either one or the other, some do not make a clear  
choice, though. Manual indications and compound boundaries are  
usually still preferred break points -- a hierarchy can be  
generalized through all levels. Breaking at graphemes or graphs  
(glyphs) is a last resort for alphabetic scripts and usually is  
rather done with non- or para-words which may prefer splitting over  
hyphenating. Breaking or splitting can be seen as hyphenation without  
visible hyphen, or the other way around.

These types (or a sensible selection thereof) could be mapped onto  
separate properties (|*-break| perhaps) or onto respective values of  
one property (|hyphenation|, |word-break| or whatever).

== CJK word breaking ==
Contrary to intuition |word-break| seems only ever useful if one is  
writing text with east-asian square "morphograms", perhaps with  
intertwined alphabetic words. Maybe it is also useful for stuff like  
URLs inside alphabetic text, but that rather seems the domain of | 
text-wrap| and |word-wrap|. Several other properties only make sense  
for (European) alphabetic scripts, so that is not an issue by itself,  
but perhaps it would be better to do this kind of script-dependent  
styling with the |:lang()| or a new |:script()| pseudo-class selector  
instead (using ISO 15924 four-letter or three-digit codes).

                                  CJK
                  -----------------------------------
                  !  strict        !  loose         !
         |--------+----------------+----------------+
         ! strict ! 'normal'       | 'loose'        |
Other   !        ! ('keep-all')   |                |
scripts |--------+----------------+----------------+
         ! loose  ! 'break-strict' | 'break-all'    |
         |--------+----------------+----------------+
          Table 1: Current draft for |word-break|

With |:script()| this could be simplified and be more flexible:

     * {word-break: normal;}
  => * {word-break: normal;}

     * {word-break: keep-all;}
  => * {word-break: normal;}
     :script(Hani), :script(Jpan)
       {word-break: strict;} /* if I understand the intention  
correctly */

     * {word-break: loose;}
  => * {word-break: normal;}
     :script(Hani), :script(Jpan), :script(Kore)
       {word-break: loose;}

     * {word-break: break-strict;}
  => * {word-break: loose;}
     :script(Hani), :script(Jpan), :script(Kore)
       {word-break: normal;}

     * {word-break: break-all;}
  => * {word-break: loose;}

If ISO 15924 introduced more general aliases based on script features  
(e.g. 'Logo' or 'Sylb' and 'Alph') the selectors could become easier  
and of course you could also write them the other way around using  
|:not()|.

== General breaking ==
 From an author's perspective it might be nice to have text breaking  
and hyphenation work similar to page (and column) breaking. We shall  
only be dealing with the "inside" variant here, so we may drop that  
name particle and inherit its values: 'auto' (~= allow) and 'avoid'.

   line-break: auto | avoid | none;
   /* ~= text-wrap, for whitespace treatment */

     text-wrap: normal;
  ~> line-break: auto;

     text-wrap: suppress;
  ~> line-break: avoid;

     text-wrap: none;
  ~> line-break: none;

     text-wrap: unrestricted;
  ~> line-break: auto; character-break: auto; hyphenation: none;

What is a /word/ in the CSS (or Unicode) sense? Any string of  
characters bordered by whitespaces or punctuation marks, any lexeme?

   word-break: none | manual | compound | _auto_ | syllable | character;
or
   compound-break:  _auto_ | avoid; /* sememic */
   word-break:      _auto_ | avoid; /* syllabic/morphemic */
   syllable-break:  auto | _avoid_; /* graphe(m/t)ic */
   character-break: auto | _avoid_; /* often not possible at all */

Hyphenation control has to be set separately, e.g. which string if  
any (i.e. not split) to use.

     word-wrap: normal;
  ~> *-break: auto;

     word-wrap: break-word;
  ~> word-break: character; hyphenation: ""; /* 1 word break property */
  ~> character-break: auto; hyphenation: ""; /* n break properties */

     * {word-break: normal;}
  ~> * {word-break: syllable;}                     /* 1, default */
  ~> * {word-break: auto; compound-break: auto;}   /* n, defaults */

     * {word-break: keep-all;}
  ~> * {word-break: none;}                         /* 1 */
  ~> * {compound-break: avoid;}                    /* n */

     * {word-break: loose;}
  ~> :script(Hani), :script(Jpan), :script(Kore)   /* 1 */
       {word-break: character;}
  ~> :script(Hani), :script(Jpan), :script(Kore)   /* n */
       {syllable-break: auto;}

     * {word-break: break-strict;}
  ~> * {word-break: character;}                    /* 1 */
     :script(Hani), :script(Jpan), :script(Kore)
       {word-break: syllable;}                     /* default */
  ~> * {syllable-break: auto;}                     /* n */
     :script(Hani), :script(Jpan), :script(Kore)
       {syllable-break: avoid;}                    /* default */

     * {word-break: break-all;}
  ~> * {word-break: character;}                    /* 1 */
  ~> * {syllable-break: auto;}                     /* n */

Yeah, well, it's not perfect at all, I just wanted to provide  
something to think about.
Received on Friday, 10 April 2009 19:14:43 UTC