CSS3, Unicode BIDI, and Vertical Text Layout

I've been working on a revision of CSS3 Text to clean up its handling of
vertical text layout. An explanation of the system, targetted primarily at
members of the Unicode mailing list, is up at

    http://fantasai.inkedblade.net/style/discuss/vertical-text/

I will be creating a separate writeup that discusses implementation concerns
and compatibility with W3C technologies.

The full text follows for archival and discussion purposes. If you want to
/read/ the document, you're best off with the HTML version; it has real
links and embedded diagrams.

Comments are welcome. CSS/W3C-specific comments should go to www-style@w3.org;
others should go to the unicode mailing list[1]. You can also send directly to
me (or the CSS WG mailing list if you're a W3C Member), but please post them
publicly so everyone can read. :)

~fantasai

[1] Subscription: http://www.unicode.org/consortium/distlist.html
     Archives: http://www.unicode.org/mail-arch/

###########################################################################

Robust Vertical Text Layout

     by fantasai <http://fantasai.inkedblade.net/contact>

    Few formatting systems today can handle vertical text layout, and most
    of those only lay out text in right-to-left columns. This document
    outlines a system that can not only handle common scripts in vertical
    right-to-left columns, but that can _gracefully_ accept uncommon
    script combinations and left-to-right text columns. The model is
    described here as a CSS system, but the concepts can apply to non-CSS
    systems as well.

    The CSS model and Unicode provide support for logical text layout, but
    only in horizontal flow. Although CSS3 Text attempts to use horizontal
    BIDI controls to handle vertical BIDI, the system it sets up is
    ill-defined and inflexible, relies on assumptions that may not hold
    true, and requires a styled document's content and its markup to be
    adapted to the CSS rather than the other way 'round. A better design
    would use the intrinsic properties of the characters and an expansion
    of Unicode's logic to lay out the text. A layout model thus based on
    the logic and knowledge of writing systems can scale to gracefully
    handle any combination of scripts, can correctly (if not optimally)
    lay out text with any combination of styling properties, and can
    integrate well with the layered Unicode + Markup + Styling model of
    semantically-tagged documents.

    The examples in this text require support for Unicode BIDI and Arabic
    shaping, and fonts for Simplified Chinese and Arabic/Farsi. Most
    diagrams are available in SVG, but inline versions are in PNG with
    fallbacks in GIF.

    Recommended browsers (recent versions):
      * Opera <http://www.opera.com/>
      * Gecko-based (such as Mozilla, Firefox, or Camino (Mac OS X))
        <http://www.mozilla.org/>
      * Microsoft Internet Explorer or Safari may suffice.

    More about Unicode fonts and other software
    <http://www.alanwood.net/unicode/>

     1. Background
          1. The &#x2018;Cascade&#x2019; in Cascading Style Sheets
          2. CSS and Unicode Bidi
     2. Abusing Directionality and Its Consequences: A Case Study of
        CSS3 Text
     3. Describing Text Flow
          1. Physical vs. Logical Description
          2. Intrinsic Directionality and Orientation
               1. Script Classification by Directionality
          3. Logical Text Flow
               1. Implying Direction
               2. Orienting by Block Progression
               3. The Three Switches of Logial Text Layout
     4. Implementing A Logical Text Layout System
          1. Composing Lines of Text
               1. Character Ordering
               2. Glyph Orientation
                    1. Vertical Scripts
                    2. Horizontal Scripts
                    3. Punctuation
               3. Character Shaping
          2. Understanding Character Properties
     5. Why and How the Unicode Consortium Should Be Involved
          1. What happens if Unicode chooses not to standardize the
             additional character data?
     6. About the Author and the Status of CSS3 Text
     7. Acknowledgements
     8. Appendix: Vertical Scripts in Horizontal Flow

Background

The "Cascade" in Cascading Style Sheets

    Unlike many formatting systems, in which styling properties are
    definitively applied to a page element at one point, CSS collects and
    applies to the element multiple style rules from the author, reader,
    and user agent. In case of a conflict, the origin of the rule and the
    specificity of the rule's element selector determine which of the
    conflicting property values takes effect on the element. This process
    of sorting and applying style rules is called cascading[1], and it
    allows style rules from multiple sources and with separate formatting
    purposes to interact in a rigorous way.

    [1] http://www.w3.org/TR/CSS21/cascade.html#cascade

    *Cascading means that style properties specified together are not
    guaranteed to take effect together.* This raises the design standards
    for creating CSS properties and pushes them towards a more logical,
    rather than physical, description of the intended design.

CSS and Unicode BIDI

    CSS2 introduced the direction and unicode-bidi properties to
    incorporate markup directives such as HTML's dir attribute into the
    CSS rendering model, and to allow the use of markup semantics in
    assigning BIDI embeddings. The direction property can take the values
    ltr and rtl, and this value inherits to descendant elements. The
    unicode-bidi property assigns embeddings and overrides in the
    direction given by the direction property. Its behavior is defined in
    terms of the Unicode embedding and override codes.

       /* map 'dir' attribute to 'direction'  + embedding*/
       *[dir="ltr"] {direction: ltr; unicode-bidi: embed; }
       *[dir="rtl"] {direction: rtl; unicode-bidi: embed; }
       /* embed quotations so they always stay as a single unit */
       q {unicode-bidi: embed;}

    When applied to a block of text, the direction property specifies the
    block's embedding direction; CSS documents do not use heuristics to
    guess the block's embedding direction.

    These properties are meant to reflect BIDI distinctions necessary for
    the proper ordering of text. Authors in general are discouraged[2]
    from using the properties in favor of the direct markup that would
    trigger the appropriate values.

    [2] http://www.w3.org/TR/i18n-html-tech-bidi/#ri20030728.092130948

Abusing Directionality and Its Consequences: A Case Study of CSS3 Text CR

    CSS3 Text was intended to update and expand the text layout
    capabilities of CSS2 by adding support for more international
    typesetting features and introducing controls for laying out vertical
    text. It defines a 'block-progression' property, which switches the
    line stacking direction, and hijacks 'rtl' and 'ltr' values of the
    'direction' property to use as an inline-progression control in
    vertical text.

    Cite: http://www.w3.org/TR/2003/CR-css3-text-2003051/#writing-mode

     |  writing-mode:  direction:  block-progression:  Common Usage:
     |  lr-tb          ltr         tb                  Latin-based, Greek, Cyrillic
     |                                                 (and many others)
     |  rl-tb          rtl         tb                  Arabic, Hebrew
     |  tb-rl          ltr         rl                  Chinese, Japanese, Korean
     |  tb-lr          rtl         lr                  Traditional Mongolian

    It is a good example of how _not_ to set up a vertical text system.

    In order to interface with the Unicode BIDI Algorithm[3], CSS3 Text
    maps vertical scripts' character directionality based on the
    paragraph's block progression.

    [3] http://www.unicode.org/reports/tr9/

    Because all vertical scripts in Unicode are assigned a canonical
    directionality of left-to-right, BIDI reordering proceeds as normal
    when the text columns are stacked right-to-left.

    However, if the columns of text are stacking the other way--from
    left to right--then the same characters (which so far are all given
    left-to-right directionality in Unicode) are treated as right-to-left
    characters (R). This was done because left-to-right scripts such as
    Latin read bottom to top when the lines of text were ordered left to
    right. You can see this often on image and table captions when the
    text runs along the left side. The first line of text runs from bottom
    to top, and lines stack with each one to the right of the previous
    (left to right). In this case, top-to-bottom scripts _must_ go in the
    direction _opposite_ the left-to-right text, and the opposite of ltr
    is rtl.

    Messing with the directionality of vertical scripts messes with other
    bits of text layout as well, and much of this interaction was left
    undefined. Character shaping, for example, depends on the
    BIDI-reordered string being in normal order. Not only character
    ordering, but the character shaping algorithm and the font rendering
    code all need to compensate for the altered input to the BIDI
    algorithm, and CSS3 Text failed to explain the necessary changes.

    For example, Mongolian is a cursive vertical script. Like Arabic (to
    which it is related), it is also a shaping script: a letter at the
    beginning of a word is shaped differently from one in the middle or at
    the end. Unicode defines Mongolian to be a left-to-right script, so
    shaping makes the leftmost character of a word into an initial and the
    rightmost character into a final. If, however, the Mongolian word is
    ordered right-to-left, then the initial letter of the word will be on
    the right, and therefore shaped as a final and not an initial. This is
    because shaping happens _after_ reordering. Vertical Mongolian text
    treated like this will look upside down and read like a bunch of
    gibberish, and no amount of glyph rotation can fix the problem. To
    make the letters connect properly under the CSS right-to-left
    override, the Mongolian parts of the text would need to be shaped in
    reverse and then have their glyphs rotated 180°--but this is not even
    mentioned in CSS3 Text.

    To accomodate CSS3 Text's ill-defined tweaks to BIDI reordering
    (and character shaping and font rendering), a layout system can't
    simply pass the string to standard Unicode processing functions.
    Assume, however, that the layout system manages to hold up internally
    the pretense that "top-to-bottom" is "right-to-left". It still needs
    to interact with BIDI instructions from the outside world, which
    doesn't share the delusion. In an effort to make the outside world
    _seem_ like it's adapted to these changes, CSS3 Text instructs the
    designer to use 'direction: rtl' when assigning 'block-progression: lr'
    to a block of top-to-bottom text (such as Mongolian or Chinese, both
    actually ltr scripts), in effect asking him to lie about the text's
    properties. Like most lies, it seems to work in the general case, but
    as the situation gets complicated, the system breaks down...

      * Foremost, if the expected block progression fails to take effect--
        whether through the cascade or through lack of UA support--the
        text direction and the assigned embedding direction no longer match
        and the subtleties of Unicode BIDI can wreak havoc on the order of
        the text.
                             <example>

      * CSS embeddings set on elements _within_ the formatted block are no
        longer necessarily going the right way.
                             <example>

      * HTML dir attributes that were added with the assumption of
        regular, horizontal text might or might not need to have their
        effects be reversed.
                             <example>

      * There is no mention of how character shaping should happen.

    In conclusion, abusing directionality controls to make a limited
    system lay out text correctly doesn't scale. It's a hack, not a
    solution.

Describing Text Flow

    To describe how a text flows into lines, one needs to know three
    things:

      Image: Three Vectors
             <http://fantasai.inkedblade.net/style/discuss/vertical-text/diagrams/text-flow-vectors-tb.svg>

      * which way the text flows within a line (inline progression)
      * which way the lines stack (block progression)
      * which way the glyphs are facing (glyph orientation)

    However, not all combinations of text direction and glyph orientation
    are valid. Therefore if certain of the character's inherent
    characteristics are known, it is often possible to derive one from the
    other. Unicode systems take advantage of this model in horizontal
    text: you don't have to manually tell every run of Hebrew to order
    itself right-to-left, and you don't need to specify that Mongolian
    characters turn themselves sideways when the text is running
    horizontally left-to-right.

Logical vs. Physical Description

    In a purely physical layout scheme, each of these text layout
    properties would be given as an absolute: The inline progression of
    this run of English is top to bottom, its glyph orientation is 90
    degrees clockwise, its block progression is from right to left.

    Image: Diagram of vectors for rotated English
           <http://fantasai.inkedblade.net/style/discuss/vertical-text/diagrams/text-flow-vectors-rl.svg>

    However, because the
    interrelationships among these properties are realized in the author's
    mind and not in the system,
      * The author must manually intervene any time there is a script
        change.
      * If one of the three properties fails to take effect (because of
        the Cascade or lack of UA support), then the layout breaks and the
        text becomes unreadable.

    A better system would embed knowledge of different scripts' intrinsic
    characteristics and define style properties in terms of the
    relationships among them.

Intrinsic Directionality and Orientation

    Each script has a characteristic writing direction, and each character
    in Unicode is assigned a directionality value based this
    characteristic. Unfortunately, Unicode currently only defines
    horizontal directionality even though vertical and bi-orientational
    scripts have a vertical directionality as well. For example, while
    English can go either top to bottom or bottom to top (since it doesn't
    have a vertical directionality), Japanese must only go from top to
    bottom, even in a left-to-right block progression. Mongolian also has
    top-to-bottom vertical directionality. Unlike Japanese however, it has
    no definite horiziontal directionality (just a canonical one for BIDI
    purposes).

Script Classification by Directionality

    Scripts can be classified into three orientational categories:

    horizontal
           Scripts that have horizontal, but not vertical, directionality.
           Includes: Latin, Arabic, Hebrew, Devanagari

    vertical
           Scripts that have vertical, but not horizontal, directionality.
           Includes: Mongolian, Manchu

    bi-orientational
           Scripts that have both vertical and horizontal directionality.
           Includes: Han, Hangul, Yi, Ogham

    Bi-orientational scripts may be further classified by how their glyphs
    transform when switching orientations. CJK characters translate; they
    are always upright. Other scripts, such as Ogham and some variants of
    classical Yi, must be rotated.

Logical Text Flow

Implying Direction

    Scripts in their native orientation need no additional stylistic hints
    for proper layout: their inline progression and glyph orientation are
    both intrinsically mandated, so the style system can know by itself
    how to lay them out. *Directionality and glyph orientation overrides
    are not necessary and should not be used._*(In fact, using them
    degrades the system by creating a tangle of dependencies, as
    demonstrated in the section on the current version of CSS3 Text.)

    Scripts in a foreign orientation don't need directionality or glyph
    overrides either. They just need a few hints: whether to translate
    upright, or, if they're rotated sideways, which side is "up". Given
    that, the rules for laying out the text in its native orientation are
    enough to determine the inline progression and exact glyph
    orientation.

Orienting by Block Progression

    For scripts in a non-native orientation, the natural inline text flow
    depends on the direction of line stacking: the text is most
    comfortably laid out as if the whole text block were merely rotated
    from the horizontal. For example, English text in vertical lines that
    stack from left to right will face with the glyphs' tops towards the
    left and the text direction running from bottom to top. The same text,
    by the same logic, would in a right-to-left line stacking context face
    right and flow within each line from top to bottom.

    Image: Diagram of text block rotation
           <http://fantasai.inkedblade.net/style/discuss/vertical-text/diagrams/text-flow-natural.svg>

    Note: Merely rotating the rendered text from a horizontal layout is
    not sufficient because while the primary script is horizontal, it may
    include some vertical text (such as Chinese) that would need to be
    laid out appropriately for vertical lines.

    Putting this logic into the style system is straightforward: define
    "up" for non-native glyphs to point to the beginning of the line
    stack, and the inline progression follows from that orientation. The
    glyph orientation and inline progression will thus adapt to whichever
    block progression happens to take effect.

    This layout scheme is most appropriate for dealing with text that has
    been turned on its side for layout purposes--as for page headers
    or captions or table headings. However, a major use case for laying
    out text in a non-native orientation is mixing horizontal and vertical
    scripts, which introduces the requirement of making the secondary
    scripts flow well in the context of the primary script.

    For example, a primarily Mongolian document, which has vertical lines
    stacking left to right, usually lays its Latin text with the glyphs
    facing the right. This makes the text run in the same inline
    progression as Mongolian and face the same direction it does in other
    East Asian layouts (which have vertical lines stacking right to left),
    but the glyphs are facing the _bottom_ of the line stack rather than
    the top, something they wouldn't do in a primarily-English paragraph.

    Image: Text Flow Vectors in Mongolian Text
           <http://fantasai.inkedblade.net/style/discuss/vertical-text/diagrams/mongolian-vectors.jpg>

    Yet another common layout is to keep the horizontal script's glyphs
    upright and order them from top to bottom; this is frequently done
    with Latin-script acronyms in vertical East Asian text.

    Image: <http://fantasai.inkedblade.net/style/discuss/vertical-text/diagrams/vertical-acronym.gif>

    To handle these layouts, the style system needs to offer controls for
    choosing among these different layout schemes. Note, however, that
    scripts in their native orientations do not need these hints; only the
    non-native ones do. Also, this is only one simple scheme switch here:
    there's no need for the designer to set separate absolute inline
    progression and glyph orientation controls or to set styling
    properties on each text run of a different script.

The Three Switches of Logical Text Layout

    In summary, to lay out a block of arbitrary, mixed-script text, the
    layout system needs to offer only three controls:

      * primary script's directionality (BIDI property)
      * block progression direction (stylistic property)
      * glyph orientation scheme (stylistic property)

    Formalized into CSS syntax, this becomes:

    direction
           Primary directionality. Can take the following values

         ltr
                 Left-to-right directionality in horizontal text; No
                 inherent directionality in vertical text. (Horizontal
                 script) Examples: Latin, Tibetan

         rtl
                 Right-to-left directionality in horizontal text; No
                 inherent directionality in vertical text. (Horizontal
                 script) Examples: Arabic, Hebrew

         ttb
                 Top to bottom directionality in vertical text; No
                 inherent directionality in horizontal text. (Vertical
                 script) Example: traditional Mongolian

         ltr-ttb
                 Left to right directionality in horizontal text; Top to
                 bottom directionality in vertical text. (Bi-orientational
                 script) Examples: Han, modern Yi

         ltr-btt
                 Left to right directionality in horizontal text; Bottom
                 to top directionality in vertical text. (Bi-orientational
                 script) Example: Ogham

    block-progression
           Block progression (line stacking) direction. Can take the
           following values

         tb
                 Top-to-bottom line stacking (horizontal text). Typically
                 used for most non-East-Asian layout.

         rl
                 Right-to-left line stacking (vertical text). Typically
                 used for traditional CJK layout.

         lr
                 Left-to-right line stacking (vertical text). Typically
                 used for traditional Mongolian layout.

    text-orientation-vertical
           Glyph orientation scheme to use in vertical text. Can take the
           following values

         natural
                 Non-vertical script runs are laid out as if "up" was
                 towards the top of the line stack (left or right,
                 depending on the block progression in effect). (Vertical
                 scripts are laid out as vertical scripts.)

         left
                 Non-vertical script runs are laid out as if "up" was
                 towards the left side of the line stack. (Vertical
                 scripts are laid out as vertical scripts.)

         right
                 Non-vertical script runs are laid out as if "up" was
                 towards the right of the line stack. (Vertical scripts
                 are laid out as vertical scripts.)

         upright
                 Non-vertical scripts' characters read top to bottom, with
                 each grapheme cluster oriented upright. (Vertical scripts
                 are laid out as vertical scripts.)

           For handling vertical-only scripts in horizontal layout, a
           text-orientation-horizontal property is also necessary; it
           takes effect only when the block progression is top-to-bottom.
           To keep the discussion less verbose, I am relegating
           consideration of horizontal layout to an appendix.

    As long as the directionality is set correctly for the text (and it
    should be set automatically from the content/markup as long as the
    designer doesn't touch it later), any combination of the
    block-progression and text-orientation stylistic values will result in
    a correct (though perhaps not optimally-designed) text layout.

    The style system can thus handle most of the intricacies of laying out
    both usual and unusual combinations of text by itself. What it needs
    to do this, however, is to know the intrinsic properties of the
    characters and the logic of laying them out.

Implementing A Logical Text Layout System

Composing Lines of Text

    Handling block-progression is very straightforward: just stack the
    composed lines in the stacking direction. Composing the lines of text
    is more complicated. The text needs to go through three processing
    steps.

Character Ordering

    Character ordering is where the BIDI algorithm gets applied. *The
    algorithm remains essentially unchanged when dealing with vertical
    text: what changes is the data.* Specifically, the directionality
    values of certain characters are mapped into the algorithm differently
    depending on the styling context.

    The Unicode Bidirectional (BIDI) Algorithm deals with two
    directions: left-to-right (towards right) and right-to-left (towards
    left), defined to be the same as the script directionalities involved.
    Although this multi-directional model has several more directionality
    values, the BIDI algorithm here still deals with only two directions:
    it just abstracts them so that they could just as easily be
    bottom-to-top (towards top) and top-to-bottom (towards bottom). To
    avoid the apparent absurdity of mapping right to left and such things,
    I will call the two BIDI directions "high" (H) and "low" (W).
    (Implementations, no doubt, will prefer to call them "left" and
    "right" to map directly into the Unicode BIDI algorithm.)

    It is important to keep in mind that these directions are abstract. We
    will map "left", "right", "top", and "bottom" to "high" or "low" based
    on the values of text-orientation and block-progression. *The mapping
    applies to everything: the individual character's directionality,
    embedding and override codes, the CSS direction values, HTML dir
    attributes, etc.* Once the line is composed, we then lock "high" and
    "low" to the appropriate sides of the block as we stack the lines
    according to block-progression.

Directionality Mapping: Vertical Case

    In vertical context, bi-orientational scripts use their vertical
    directionality and behave as vertical, not horizontal, scripts. Han,
    for example, as a ltr-ttb script, is treated as ttb (top to bottom),
    _not_ ltr (left to right). The ltr-ttb value for direction is
    correspondingly treated the same way as the value ttb.

For text-orientation: right (and text-orientation: natural in a
right-to-left block progression):

      * Map ttb and ltr to htl (high to low)
      * Map btt and rtl to lth (low to high)

    Image: Diagram of 'right' Mapping
           <http://fantasai.inkedblade.net/style/discuss/vertical-text/diagrams/bidi-right.svg>

    Run the Unicode BIDI Algorithm with its "left" being our "high" and
    its "right" being our "low".

For text-orientation: left (and text-orientation: natural in a left-to-right
block progression):

      * Map ttb and rtl to lth (low to high)
      * Map btt and ltr to htl (high to low)

    Image: Diagram of 'left' mapping
           <http://fantasai.inkedblade.net/style/discuss/vertical-text/diagrams/bidi-left.svg>

    Run the Unicode BIDI Algorithm with its "left" being our "high" and
    its "right" being our "low".

For text-orientation: upright

      * Map ttb, ltr, and rtl to htl (high to low)
      * Map btt to lth (low to high)

    Image: Diagram of 'upright' mapping
           <http://fantasai.inkedblade.net/style/discuss/vertical-text/diagrams/bidi-upright.svg>

    Run the Unicode BIDI Algorithm with its "left" being our "high" and
    its "right" being our "low".

Glyph Orientation

    Before the system can paint the text (or even do alignment), it needs
    to know how to rotate the glyphs. For vertical and bi-orientational
    scripts, this is simply "rotate me to my intrinsic position". This
    doesn't mean "don't rotate me, I'm supposed to be upright", however,
    because *the standard representation of a character in a font is the
    one used in horizontal text*.

Vertical Scripts

    Han and Kana and Hangul and Yi do need to be kept upright (0°
    rotation) because they use the same orientation in both horizontal and
    vertical text. Mongolian (and Ogham), however, rotate from one context
    to the other and so their glyphs must be rotated 90° from their
    horizontal orientation when used in vertical context. Part of the
    system's knowledge, therefore, needs to be which scripts need to be
    rotated and which merely translated into place. Given that and the
    script's directionality, the exact rotation can be derived as follows:

    System's Knowledge of Vertical Scripts' Properties -
                              Han/Hangul/Kana/Yi   Mongolian/Manchu   Ogham
     (cannonical) horizontal
     directionality.........       LTR                  (LTR)          LTR
     vertical directionality       TTB                   TTB           BTT
     transformation            translation             rotation      rotation

    System's Derivation of Vertical Scripts' Orientation
                            Han/Hangul/Kana/Yi    Mongolian/Manchu    Ogham
    horizontal orientation
    (vector direction)....
       inline progression          90°                  90°             90°
       glyph orientation           0°                   0°              0°

    transformation........
       inline progression        rot 90°              rot 90°         rot -90°
       glyph orientation         static               rot (90°)       rot (-90°)

    vertical orientation
    (vector direction)....
       inline progression          180°                 180°            0°
       glyph orientation           0°                   90°             270°

Horizontal Scripts

    For horizontal scripts, the method is "rotate me according to the
    relevant text-orientation style".

For text-orientation: right or text-orientation: natural in a right-to-left
block progression:

    Rotate horizontal scripts' grapheme clusters 90° to the right.

    Image: Glyphs rotated right
           <http://fantasai.inkedblade.net/style/discuss/vertical-text/diagrams/glyph-right.svg>

For text-orientation: left or text-orientation: natural in a left-to-right
block progression:

    Rotate horizontal scripts' grapheme clusters 90° to the left.

    Image: Glyphs rotated left
           <http://fantasai.inkedblade.net/style/discuss/vertical-text/diagrams/glyph-left.svg>

For text-orientation: upright

    Keep glyphs for horizontal scripts upright and stack grapheme clusters
    vertically.

    Image: Glyphs translated upright
           <http://fantasai.inkedblade.net/style/discuss/vertical-text/diagrams/glyph-upright.svg>

Punctuation

    Transformations for punctuation, being somewhat arbitrary and
    stylistic, should be handled by using vertical glyph variants given in
    the font, but only when the direction of the text is a vertical or
    bi-orientational directionality or text-orientation-vertical is
    upright. (If the text is primarily horizontal text rotated sideways,
    then the punctuation should likewise be horizontal punctuation rotated
    sideways.)

Character Shaping

    Character shaping is the process of selecting, based on context, which
    of several allographs of a letter should be used. This is typical of
    cursive scripts like Arabic and Mongolian, where the shape of a letter
    depends on whether it comes at the start of a word, in the middle of a
    word, or at the end of a word.

    Image: Diagram of Arabic shaping
           <http://fantasai.inkedblade.net/style/discuss/vertical-text/diagrams/shaping.svg>

    According to UAX 9, character shaping occurs _after_ BIDI reordering:
    the Arabic character shaped as an "initial" will always be on the
    right, even if the text is given a left-to-right override. This
    ensures that the letters always connect. (An initial form on the left
    side of the word would be trying to connect to nothing.)

    To deal with the multiple orientations of vertical layout, the shaping
    logic needs to know not just the reordered string of characters, but
    which side of the line is "up". If we turn the glyphs all upside-down,
    for instance, the shaping needs to be done in reverse.

    Image: Diagram of reverse (upside-down) shaping
           <http://fantasai.inkedblade.net/style/discuss/vertical-text/diagrams/reverse-shaping.svg>

    Because in vertical text Arabic and Mongolian can go in the same direction
    or in opposite directions, merely inverting the entire character string
    before passing it to standard Unicode shaping functions doesn't work.

    Shaping occurs only within each directional level run. Shaping is also
    constrained to runs of text in the same script; Mongolian characters,
    from Arabic's point of view, form as concrete a boundary as Latin ones
    do. It is therefore possible to break up the text into pieces that
    have characters from no more than one shaping-affected script without
    compromising the accuracy of the shaping. Then, for each run of text,
    one can use the shaping script characters' glyph orientation (derived
    above) to determine which way is "up" (0°) and hence which are the
    "left" (-90°) and "right" (+90°) sides of the text run. Once that's
    known the text run can be shaped, in reverse if necessary.

Understanding Character Characteristics

    In addition to knowing the text, its primary directionality, and its
    styling properties, the implementation needs to know something about
    the characters themselves to be able to take advantage of the logical
    model. For each character, the following information must be available
    to the text layout algorithm:

      * horizontal directionality: ltr or rtl
        For vertical scripts this means the canonical directionality that
        is used for fonts and for plaintext horizontal layout.
      * vertical directionality: ttb, btt, or none
        For horizontal scripts, this is none.
      * glyph transformation between horizontal and vertical orientations:
        translation, rotation, or not applicable
        Only applies to scripts with vertical directionality.

    Unicode only specifies the horizontal directionality. For some scripts
    (not all), vertical directionality can be gleaned from the prose
    chapters describing the different writing systems. Glyph
    transformations are not given at all, only implied.

Why and How the Unicode Consortium Should Be Involved

    The text layout model outlined in this document adds to the scope of
    the Unicode BIDI Algorithm and requires additional knowledge of
    character properties. This expansion should be part of Unicode rather
    than an alteration defined in CSS3. Standardizing it at the Unicode
    level rather than at the CSS level is more appropriate because
      * The Unicode Consortium has the expertise necessary to specify the
        character data correctly even for obscure scripts.
      * The extended data and algorithms operate at the same level as the
        existing Unicode specifications.
      * Non-CSS systems wanting to use this model will have a solid base
        to work off of rather than having to adapt bits of a high-level
        protocol (CSS3) to fit their application.
      * Standard Unicode APIs can be designed to handle the extended BIDI
        and shaping manipulations so that each application will not need
        to implement all of that itself.
      * Intermediary systems such as HTML can accomodate the model by
        building up from Unicode rather than down from CSS. (HTML would
        need the new direction values for its dir attribute.)
      * Unicode can add character-level support for
        vertical-directionality by defining directionality control codes
        to correspond with the vertical directionality requirements. This
        will allow the same plain text to be properly flowed into vertical
        layout contexts as well as horizontal ones.

What happens if Unicode chooses not to standardize the additional character
data

    I will be including the results of my personal research as a normative
    appendix to the next revision of CSS3 Text. Should the Unicode
    Consortium provides the necessary character data, I will publish a new
    version that removes the appendix and instead references the relevant
    sections of the Unicode Standard.

About the Author and the Status of CSS3 Text

    I am an invited expert for the CSS Working Group at the World Wide Web
    Consortium (W3C) and the new editor of the CSS3 Text module. I intend
    to completely rewrite the Text Layout chapter for the next version of
    CSS3 Text based on the principles outlined herein.

Acknowledgements

    Thanks go out to:

      * Martin Heijdra at the Gest Library, for his guidance, expertise,
        and enthusiasm. I had never imagined that the research staffer
        helping me find books would turn out to be an expert on
        international typography and Mongolian in particular.
      * Ian Hickson, the members of the www-style mailing list, the
        members of the CSS Working Group, and the contributors to the
        Mozilla Project for tempering my technical skills and CSS
        knowledge over the years
      * The CSS Working Group for giving me a chance to fix everything I
        found wrong in the CSS3 Text drafts.
      * Last, but not least, Håkon Wium Lie and Opera Software <http://www.opera.com/>
        for supporting me in creating this work, to the point of even _paying_
        me to finish it. I only wish it hadn't taken so long so that I could
        spend more time on QA. ;)

Appendix: Vertical Scripts in Horizontal Flow
<http://fantasai.inkedblade.net/style/discuss/vertical-text/appendix>

################################################################################

-- 
http://fantasai.inkedblade.net/contact

Received on Wednesday, 20 October 2004 18:19:04 UTC