Re: CSS3 Text: Line-breaking Properties

Jukka K. Korpela wrote:
> On Mon, 21 Apr 2003, fantasai wrote:
> 
> 
>>  # In the most general case, (assuming no hyphenation dictionary is
>>  # available to the UA), a line break can occur only at white space
>>  # characters or hyphens, including U+00AD SOFT HYPHEN.
>>
>>This doesn't seem to match UAX 14.
> 
> In what sense?

In the sense that UAX 14 allows line breaks at places other than
white space characters or hyphens.


> The rules _permit_ line breaks at certain points but do not require any
 > particular behavior.

Do not specify at which point to break, you mean.

> Surely the idea is that the default rules can be applied with discretion,
 > using various criteria to prevent line breaks where UAX 14 would allow
 > them, and applying additional line breaking principles when adequate.

Yes, see the introduction to UAX 14
  <http://www.unicode.org/reports/tr14/#Introduction>
specifically

    | The definition of optimal line break is outside the scope of
    | this document. Different formatting algorithms may use different
    | methods of determining an optimal break. For example, simple
    | implementations just consider a line at a time, trying to find a
    | locally optimal line break....
    |
    | More complex algorithms may take into account the interaction of
    | line breaking decisions for the whole paragraph. The well known
    | text layout system [TEX] implements a example of such a globally
    | optimal strategy that may make complex tradeoffs across an entire
    | paragraph to avoid unnecessary hyphenation and other legal, but
    | inferior breaks...

I don't think it's within the scope of CSS to define an optimal line-breaking
algorithm. It is, however, reasonable for CSS to control the level of
strictness in line breaking.

 > Presumably "normal" is supposed to be the initial value, and I strongly
 > disagree. What you describe as "strict" is what dominated on the Web for
 > years and is easily understood, except for the zwsp part. It should be the
 > default, and the UAX 14 based method should have a name that clearly
 > reflects its definition, like "unicode-line-breaking".

What I describe as 'strict' could easily be a UA's 'normal' behavior.
'normal', however, allows the UA more freedom to define its algorithm,
as long as it keeps within the limits set by UAX 14. The UA could, for
example, use an algorithm which ranked line breaking opportunities and
chose an optimal breaking point based on the break opportunity's rank
and its distance from the end of the line.

As a simplistic example, let as define an algorithm which only allows
breaks at spaces and after hyphens.

  When determining where to break, our algorithm assigns a weight to each
  break point, where the weight is given by

     weight = (rank) + (#characters from edge)

  The point after a space is ranked 0, the point after a hyphen ranked 5.

  The optimal break point is the one with the lowest weight. Only points
  *before* the edge are considered. (The edge is where the line would break
  if we could break anywhere.)

Let's try two examples. We have a sequence of text:

                a            b                       c   d
  some text with longwordsand-hyphens and short words and-hyphens etc, etc...
                              ><                          ><

  If the edge occured at the first mark (between the first 'h' and 'y'),
  the point after the preceding hyphen would get a weight of
    w(b) = 1 + 5 = 6
  The point after the space before that hyphen would get a weight of
    w(a) = 14 + 0 = 15

  The point after the hypen has a lower weight, so we break at that point:
    some text with longwordsand- |
    hypens and...                |

  If the edge occured at the second mark (between the second 'hy' and 'y'),
  the point after the preceding hyphen would get a weight of
    w(d) = 1 + 5 = 6
  and the point after the space before that would get a weight of
    w(c) = 5 + 0 = 5

  The point after the space has a lower weight, so we break there:
    some text with longwordsand-hyphens and short words      |
    and-hyphens etc, etc...                                  |

This algorithm satisfies UAX 14: we're only breaking at points defined to
be valid break points by its requirements.

At the same time, our algorithm is more intelligent than simply breaking
at the first available opportunity.

You can, of course, extrapolate this to great complexity, and it will still
satisfy the requirements of 'normal' line breaking. However, it can't be
used for 'strict' line breaking because it allows breaking after a hyphen,
which 'strict' does not.


Do you still disagree that 'normal' should be the default?

~fantasai

Received on Saturday, 3 May 2003 02:40:03 UTC