[css3-writing-modes] bidi-style resolution of punctuation orientation

John Daggett wanted me to write a spec for bidi-style resolution of punctuation
orientation for vertical text.

Let's first review how the bidi algorithm deals with neutrals. Note: I'm going
to simplify this a little; if you want the full details, go read the bidi spec.
   http://unicode.org/reports/tr9/

Neutral Resolution in UAX9 (simplified)
---------------------------------------

   All characters in Unicode are classified as having an intrinsic directionality:
     - left (L) for LTR scripts
     - right (R) for RTL scripts
     - neutral (N) for symbols and punctuation

   The neutral characters must be resolved to either L or R before reordering happens.
   This is done by looking at their surrounding characters:
     * neutrals surrounded by L become L
          LNNNL -> LLLLL
     * neutrals surrounded by R become R
          RNNNR -> RRRRR

   But sometimes this is not enough to make a clear distinction:
      - LNNNR -> L???R

   This is where the paragraph's base directionality comes in. In an LTR paragraph,
   such conflicts resolve to L, and in an RTL paragraph they resolve to R:
      <p dir=ltr>LNNNR</p> -> <p dir=ltr>LLLLR</p>
      <p dir=rtl>LNNNR</p> -> <p dir=rtl>LRRRR</p>

   The base paragraph direction is also used to set the direction of the start and
   end of the paragraph, so that neutrals at the beginning and end of the text
   (such as quotes and periods) resolve correctly.
      <p dir=ltr>LRN</p> -> <p dir=ltr>LRL</p>
      <p dir=ltr>RLN</p> -> <p dir=ltr>RLR</p>

   These rules work correctly in many cases, but not all. For example, if you are
   listing words or phrases in the opposite direction (such as English product
   names in an Arabic sentence), the neutrals will resolve incorrectly.
     List of words: AAA, BBB, CCC. -> List of words: CCC ,BBB ,AAA.

   To work around this, Unicode has invisible characters with a strong directionality,
   which can be strategically placed to break the run:
     List of words: AAA&lrm;, BBB&lrm;, CCC. -> List of words: AAA, BBB, CCC.

   In CSS3/HTML5, we can instead isolate the phrase, so that it is resolved
   independently internally, and does not affect anything outside:
     List of words: <bdi>AAA</bid>, <bdi>BBB</bdi>, <bdi>CCC</bdi>. <!-- totally safe -->

Resolving Neutral Orientation
-----------------------------

   Now with some background on how the bidi algorithm works, we can look at how to
   map this behavior to resolving text orientation. There are several open questions:

   A) Which characters are neutral for orientation resolution?
   B) Which characters are affected by orientation resolution?
   C) How do we determine the base orientation?
   D) How do we give the author the ability to change the automatic behavior when
      it's wrong?

Suggestion for A (characters neutral for orientation resolution):

   Treat all punctuation (P*), symbols (S*), and ASCII digits (0-9) as neutral
   for the purposes of orientation resolution.

Suggestions for B (characters affected by orientation resolution):

   Not all neutral characters have an ambiguous orientation that should be
   resolved automatically. In fact, very few do. We need to make sure that
   we only automatically determine the orientation of characters whose
   orientation is subject to change based on context. Characters that have
   a definite orientation, or for whom orientation is a stylistic (rather
   than contextual) preference, should not be subject to this automatic
   resolution.

   1. First, we can dismiss any symbols and punctuation that belong to a
      particular script (i.e. are not Common) as taking the orientation
      of that script.

   2. Second, we can cut out a lot of the punctuation by general category:
      http://www.unicode.org/reports/tr44/#General_Category_Values

      * dashes and connector punctuation (Pd, Pc) should be rendered to
        extend from top to bottom; if the font has vertical metrics, we
        can assume that it handles this already, otherwise they should
        be set sideways

      * enclosing punctuation (Pi, Ps, Pf, Pe) should enclose the text;
        again, if the font has vertical metrics, we can assume that it
        handles this already, otherwise they should be set sideways

      This leaves only the Other Punctuation category, Po.

   3. Third, we can cut out a lot of the symbols by general category as well:
      Currency Symbols (Sc) and Math Symbols (Sm) are typically set sideways
      even in vertical CJK, and therefore shouldn't be subject to automatic
      resolution. This leaves only the Modifier Symbol (Sk) and Other Symbols
      (So) categories.

   4. Let's look more closely at the Sk category:
      http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AGeneral_Category%3DSk%3A]%26[%3AScript%3DCommon%3A]

      * Most of these are related to either Latin or IPA or UPA -> sideways
      * There's a Greek character -> sideways
      * And two voicing marks for kana -> upright
        (These are listed in ScriptExtensions.txt as part of Hira and Kana.)
      * The last three are fullwidth characters -> upright
      * The rest consists of several blocks of "tone letters", which
        are also intended for use with Latin-based phonetic transcriptions
        (See Unicode 6.0 section 7.8) -> sideways

      Nothing left that's ambiguous.

   5. Let's look more closely at the Po category:
      http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AGeneral_Category%3DSo%3A]%26[%3AScript%3DCommon%3A
      Working backward (because it's easier):

      * Aegean numbers should be listed in ScriptExtensions.txt -- that's
        a Unicode bug
      * U+FF61, U+FF64, U+FF65 are CJK halfwidth punctuation and should be
        set accordingly
      * The Fullwidth and Small form variants should be set upright, since
        as Wide characters they can be assumed to belong with CJK.
      * Overscores and underscores should be set sideways, parallel with
        the baseline
      * Sesame dots, vertical presentation forms, and other CJK punctuation
        should be set upright.
      * U+2E18-U+2E31 I'm unsure of.
      * Ancient Greek and New Testament symbols should be set sideways
      * Archaic punctuation (U+2056-U+205E) is unclear
      * Double punctuation for vertical text should be upright
      * General punctuation U+2016-U+2057 could be ambiguous
      * The characters from the Vedic, Mongolian, Hanunoo, Runic, Georgian,
        Devanagari, Arabic, Armenian, and Greek And Coptic blocks should
        of course be set sideways.
      * Basic Latin and Latin 1 Supplement (U+0021-U+0387) remain ambiguous

      Characters with context-dependent orientation:
        Common Other Punctuation from the ranges U+2016-U+2057 and U+0021-U+0387,
        and potentially also U+2E18-U+2E31 and U+2056-U+205E.

   6. Now let's look more closely at the So category:
      http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AGeneral_Category%3DSo%3A]%26[%3AScript%3DCommon%3A]&g=

      We can roughly group the symbols into
        Alchemical Symbols
          - Alchemical Symbols block
        Musical Symbols
          - Musical Symbols block
          - Byzantine Musical Symbols block
          - Miscellaneous Symbols — Musical symbols
        Religious / Cultural Symbols
          - Tibetan — Religious symbols
          - Tai Xuan Jing Symbols block
          - Yijing Hexagram Symbols block
          - Miscellaneous Symbols — Medical and healing symbols
          - Miscellaneous Symbols — Religious and political symbols
          - Miscellaneous Symbols — Yijing trigram symbols
          - Miscellaneous Symbols — Astrological signs
          - Miscellaneous Symbols — Zodiacal symbol
          - Miscellaneous Symbols — Yijing monogram and digram symbols
        Geometric shapes
          - Miscellaneous Symbols And Arrows - anything not "arrows"
          - Dingbats — Stars, asterisks and snowflakes
          - Dingbats — Crosses
          - Miscellaneous Symbols — Syriac cross symbols
          - Miscellaneous Symbols — Pentagram symbols
          - Miscellaneous Symbols — Circles
          - Geometric Shapes — Control code graphics
          - Geometric Shapes — Geometric shapes
        Drawing
          - Box Drawing block
          - Block Elements block
        Arrows
          - Arrows block
          - Miscellaneous Symbols And Arrows - anything with "arrows"
          - Dingbats - Dingbat arrow
          - Miscellaneous Symbols — Pointing hand symbols
        Game Symbols
          - Playing Cards block
          - Domino Tiles block
          - Mahjong Tiles block
          - Miscellaneous Symbols — Chess symbols
          - Miscellaneous Symbols — Playing card symbols
          - Miscellaneous Symbols — Go markers
          - Miscellaneous Symbols — Dice
          - Miscellaneous Symbols — Symbols for draughts and checkers
          - Miscellaneous Symbols — Japanese chess symbols
        Pictographs
          - Transport and Map Symbols block
          - Miscellaneous Symbols and Pictographs block
          - Phaistos Disc
          - Dingbats — Miscellaneous ranging U+2701-U+2712
          - Miscellaneous Symbols - blocks not mentioned elsewhere
        Emoticons
          - Miscellaneous Symbols — Emoticons
          - Emoticons block
        Enclosed Text
          - Enclosed Ideographic Supplement block
          - Enclosed Alphanumeric Supplement block
          - Enclosed CJK Letters And Months block except
             - Squared Latin abbreviation
             - Telegraph symbols for months
        Block Ligatures - Latin
          - Enclosed CJK Letters and Months - Squared Latin abbreviation
          - CJK Compatibility — Squared Latin abbreviation
        Block Ligatures - CJK Ideographic
          - CJK Compatibility (everything else)
          - Enclosed CJK Letters and Months — Telegraph symbols for months
        Ancient Writing
          - Aegean Numbers block
          - Ancient Symbols block (Roman symbols)
        Modern Writing
          - Halfwidth and Fullwidth Forms block
          - Arabic Presentation Forms A block
          - Common Indic Number Forms block
        Ideographic Pieces
          - Ideographic Description Characters block
          - Kanbun block
          - CJK Strokes block
        Ornaments
          - Dingbats — Punctuation ornaments
          - Dingbats — Miscellaneous U+2753-U+2757
        Other Symbols
          - CJK Symbols and Punctuation - U+3004-U+303F
          - Dingbats — Heavy variants of arithmetic symbols
          - Dingbats — Miscellaneous ranging U+2713-U+2752, U+2758-U+27BF
          - Miscellaneous Symbols — Genealogical symbols
          - Miscellaneous Symbols — Gender symbol
          - Miscellaneous Symbols — Recycling symbols
          - Control Pictures block
          - Miscellaneous Technical block
        Letterlike and Latin Symbols
          - Latin 1 block
          - Letterlike Symbols block

        The Ancient and Modern Writing symbols should just follow the practice
        of the scripts they're used in.

        Letterlike Symbols and actual letters overlap in usage, so it would
        be, imho, a bad idea to treat one different from the other. It would
        be odd, for example, for an inverted ohm sign to be treated differently
        from omega. Therefore I suggest making these sideways; they can be
        tipped upright, like the omega, explicitly if needed. (For consistency,
        the copyright sign and the registered sign should join the sound
        recording copyright sign in this.) But alternatively we could make
        them context-dependent; I am unsure whether that would give correct
        results, however, as it's unclear whether they should be set upright in
        CJK text or not.

        The broken bar is EAW=Na, which suggests a sideways setting.

        That leaves the the section sign, degree sign, and pilcrow sign
        (U+00A7, U+00B0, U+00B6) as ambiguous in Latin 1.

        Koji and I investigated arrows and drawing characters and concluded
        that these, too, should be set sideways including for CJK.

        Should the Ornaments be set sideways as well?

        For the various pictographic and geometric symbols, what is an
        appropriate setting? Should the snowman be upright or sideways?
        Does this depend on CJK vs. Latin context, or is it a stylistic
        preference, or does everybody just want them upright?

        The enclosed and ligated ideographic characters and the descriptive
        pieces should certainly be upright. What about enclosed Latin?

        It seems unlikely that the Latin block ligatures (units) would be
        used outside CJK context, so they do not need to be ambiguous.
        Should they be treated as letter-like symbols (set sideways by
        default) or be set upright by default?

        Characters with context-dependent orientation:
           - U+00A7, U+00B0, U+00B6 from Latin 1
           - Potentially all pictographs, geometric shapes, and other symbols
             (recycling, etc)
           - Potentially enclosed alphanumerics
           - Potentially all letterlike symbols

Suggestions for C (base orientation):

   In order to function correctly, the bidi algorithm needs a correctly-set
   base directionality for the paragraph. This is a property of the content,
   and not something that can be reliably auto-detected; this is why we have
   the 'dir' attribute in HTML.

   Likewise, to automatically resolve the text's orientation, we need to
   know its base orientation. We could use a heuristic like "check the first
   character in the paragraph", but since that will give the wrong answer if
   the paragraph begins with a foreign word or name (such as an English product
   name), we need to be able to encode this information in the HTML.

   This can be done with the HTML lang tag, which can accept script subtags
   from ISO 15924. If a document is tagged as lang="zh-Hant", we know it is
   written in traditional Chinese, and therefore will have an upright base
   orientation. Similarly if a document is tagged as lang="ja-Jpan", we know
   it is written in a combination of Han, Hiragana, and Katakana, and its
   base orientation is upright.

   A potentially tricky case is Mongolian, which should have ambiguous
   punctuation resolve to sideways, but if we choose to make pictographic
   symbols context-dependent, should trigger upright symbols.

   The question then is, what do we do if the script is not tagged (as it
   almost never will be)? Do we use a heuristic, or default to one orientation
   or another? If so, which one?

Suggestions for D (author overrides):

   If ambiguous punctuation resolves to the wrong orientation, how can the
   author let us know the correct orientation (without wrapping the affected
   punctuation characters in <span>s, as that would probably give very awkward
   markup)?

   One potential answer would be to say that if an inline's 'text-orientation'
   property is different from the one that does auto-correction, or its script
   tag resolves to a different base orientation, then its contents are ignored
   during neutral resolution. That way a list's items can be individually
   wrapped with an effect similar to <bdi>.

Conclusion
----------

Bidi-style neutral resolution may gain us some things, but it's not a panacea,
and it introduces a lot of complexity.

Pros:
   It solves a handful of ambiguous cases.

Cons:
   Needs heuristics that are not very reliable and can give unexpected results.
   Understanding the why of those results and how to fix them will be hard.

Unchanged:
   Needing to categorize all the symbols and punctuation as to how they should
   behave in vertical text; we just have a new bucket ("context-dependent")
   in addition to the others ("upright", "sideways", "stylistic choice, so
   default to sideways").

~fantasai

Received on Friday, 1 July 2011 02:33:17 UTC