- From: fantasai <fantasai.lists@inkedblade.net>
- Date: Thu, 30 Jun 2011 19:32:51 -0700
- To: "www-style@w3.org" <www-style@w3.org>
John Daggett wanted me to write a spec for bidi-style resolution of punctuation orientation for vertical text. Let's first review how the bidi algorithm deals with neutrals. Note: I'm going to simplify this a little; if you want the full details, go read the bidi spec. http://unicode.org/reports/tr9/ Neutral Resolution in UAX9 (simplified) --------------------------------------- All characters in Unicode are classified as having an intrinsic directionality: - left (L) for LTR scripts - right (R) for RTL scripts - neutral (N) for symbols and punctuation The neutral characters must be resolved to either L or R before reordering happens. This is done by looking at their surrounding characters: * neutrals surrounded by L become L LNNNL -> LLLLL * neutrals surrounded by R become R RNNNR -> RRRRR But sometimes this is not enough to make a clear distinction: - LNNNR -> L???R This is where the paragraph's base directionality comes in. In an LTR paragraph, such conflicts resolve to L, and in an RTL paragraph they resolve to R: <p dir=ltr>LNNNR</p> -> <p dir=ltr>LLLLR</p> <p dir=rtl>LNNNR</p> -> <p dir=rtl>LRRRR</p> The base paragraph direction is also used to set the direction of the start and end of the paragraph, so that neutrals at the beginning and end of the text (such as quotes and periods) resolve correctly. <p dir=ltr>LRN</p> -> <p dir=ltr>LRL</p> <p dir=ltr>RLN</p> -> <p dir=ltr>RLR</p> These rules work correctly in many cases, but not all. For example, if you are listing words or phrases in the opposite direction (such as English product names in an Arabic sentence), the neutrals will resolve incorrectly. List of words: AAA, BBB, CCC. -> List of words: CCC ,BBB ,AAA. To work around this, Unicode has invisible characters with a strong directionality, which can be strategically placed to break the run: List of words: AAA‎, BBB‎, CCC. -> List of words: AAA, BBB, CCC. In CSS3/HTML5, we can instead isolate the phrase, so that it is resolved independently internally, and does not affect anything outside: List of words: <bdi>AAA</bid>, <bdi>BBB</bdi>, <bdi>CCC</bdi>. <!-- totally safe --> Resolving Neutral Orientation ----------------------------- Now with some background on how the bidi algorithm works, we can look at how to map this behavior to resolving text orientation. There are several open questions: A) Which characters are neutral for orientation resolution? B) Which characters are affected by orientation resolution? C) How do we determine the base orientation? D) How do we give the author the ability to change the automatic behavior when it's wrong? Suggestion for A (characters neutral for orientation resolution): Treat all punctuation (P*), symbols (S*), and ASCII digits (0-9) as neutral for the purposes of orientation resolution. Suggestions for B (characters affected by orientation resolution): Not all neutral characters have an ambiguous orientation that should be resolved automatically. In fact, very few do. We need to make sure that we only automatically determine the orientation of characters whose orientation is subject to change based on context. Characters that have a definite orientation, or for whom orientation is a stylistic (rather than contextual) preference, should not be subject to this automatic resolution. 1. First, we can dismiss any symbols and punctuation that belong to a particular script (i.e. are not Common) as taking the orientation of that script. 2. Second, we can cut out a lot of the punctuation by general category: http://www.unicode.org/reports/tr44/#General_Category_Values * dashes and connector punctuation (Pd, Pc) should be rendered to extend from top to bottom; if the font has vertical metrics, we can assume that it handles this already, otherwise they should be set sideways * enclosing punctuation (Pi, Ps, Pf, Pe) should enclose the text; again, if the font has vertical metrics, we can assume that it handles this already, otherwise they should be set sideways This leaves only the Other Punctuation category, Po. 3. Third, we can cut out a lot of the symbols by general category as well: Currency Symbols (Sc) and Math Symbols (Sm) are typically set sideways even in vertical CJK, and therefore shouldn't be subject to automatic resolution. This leaves only the Modifier Symbol (Sk) and Other Symbols (So) categories. 4. Let's look more closely at the Sk category: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AGeneral_Category%3DSk%3A]%26[%3AScript%3DCommon%3A] * Most of these are related to either Latin or IPA or UPA -> sideways * There's a Greek character -> sideways * And two voicing marks for kana -> upright (These are listed in ScriptExtensions.txt as part of Hira and Kana.) * The last three are fullwidth characters -> upright * The rest consists of several blocks of "tone letters", which are also intended for use with Latin-based phonetic transcriptions (See Unicode 6.0 section 7.8) -> sideways Nothing left that's ambiguous. 5. Let's look more closely at the Po category: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AGeneral_Category%3DSo%3A]%26[%3AScript%3DCommon%3A Working backward (because it's easier): * Aegean numbers should be listed in ScriptExtensions.txt -- that's a Unicode bug * U+FF61, U+FF64, U+FF65 are CJK halfwidth punctuation and should be set accordingly * The Fullwidth and Small form variants should be set upright, since as Wide characters they can be assumed to belong with CJK. * Overscores and underscores should be set sideways, parallel with the baseline * Sesame dots, vertical presentation forms, and other CJK punctuation should be set upright. * U+2E18-U+2E31 I'm unsure of. * Ancient Greek and New Testament symbols should be set sideways * Archaic punctuation (U+2056-U+205E) is unclear * Double punctuation for vertical text should be upright * General punctuation U+2016-U+2057 could be ambiguous * The characters from the Vedic, Mongolian, Hanunoo, Runic, Georgian, Devanagari, Arabic, Armenian, and Greek And Coptic blocks should of course be set sideways. * Basic Latin and Latin 1 Supplement (U+0021-U+0387) remain ambiguous Characters with context-dependent orientation: Common Other Punctuation from the ranges U+2016-U+2057 and U+0021-U+0387, and potentially also U+2E18-U+2E31 and U+2056-U+205E. 6. Now let's look more closely at the So category: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AGeneral_Category%3DSo%3A]%26[%3AScript%3DCommon%3A]&g= We can roughly group the symbols into Alchemical Symbols - Alchemical Symbols block Musical Symbols - Musical Symbols block - Byzantine Musical Symbols block - Miscellaneous Symbols — Musical symbols Religious / Cultural Symbols - Tibetan — Religious symbols - Tai Xuan Jing Symbols block - Yijing Hexagram Symbols block - Miscellaneous Symbols — Medical and healing symbols - Miscellaneous Symbols — Religious and political symbols - Miscellaneous Symbols — Yijing trigram symbols - Miscellaneous Symbols — Astrological signs - Miscellaneous Symbols — Zodiacal symbol - Miscellaneous Symbols — Yijing monogram and digram symbols Geometric shapes - Miscellaneous Symbols And Arrows - anything not "arrows" - Dingbats — Stars, asterisks and snowflakes - Dingbats — Crosses - Miscellaneous Symbols — Syriac cross symbols - Miscellaneous Symbols — Pentagram symbols - Miscellaneous Symbols — Circles - Geometric Shapes — Control code graphics - Geometric Shapes — Geometric shapes Drawing - Box Drawing block - Block Elements block Arrows - Arrows block - Miscellaneous Symbols And Arrows - anything with "arrows" - Dingbats - Dingbat arrow - Miscellaneous Symbols — Pointing hand symbols Game Symbols - Playing Cards block - Domino Tiles block - Mahjong Tiles block - Miscellaneous Symbols — Chess symbols - Miscellaneous Symbols — Playing card symbols - Miscellaneous Symbols — Go markers - Miscellaneous Symbols — Dice - Miscellaneous Symbols — Symbols for draughts and checkers - Miscellaneous Symbols — Japanese chess symbols Pictographs - Transport and Map Symbols block - Miscellaneous Symbols and Pictographs block - Phaistos Disc - Dingbats — Miscellaneous ranging U+2701-U+2712 - Miscellaneous Symbols - blocks not mentioned elsewhere Emoticons - Miscellaneous Symbols — Emoticons - Emoticons block Enclosed Text - Enclosed Ideographic Supplement block - Enclosed Alphanumeric Supplement block - Enclosed CJK Letters And Months block except - Squared Latin abbreviation - Telegraph symbols for months Block Ligatures - Latin - Enclosed CJK Letters and Months - Squared Latin abbreviation - CJK Compatibility — Squared Latin abbreviation Block Ligatures - CJK Ideographic - CJK Compatibility (everything else) - Enclosed CJK Letters and Months — Telegraph symbols for months Ancient Writing - Aegean Numbers block - Ancient Symbols block (Roman symbols) Modern Writing - Halfwidth and Fullwidth Forms block - Arabic Presentation Forms A block - Common Indic Number Forms block Ideographic Pieces - Ideographic Description Characters block - Kanbun block - CJK Strokes block Ornaments - Dingbats — Punctuation ornaments - Dingbats — Miscellaneous U+2753-U+2757 Other Symbols - CJK Symbols and Punctuation - U+3004-U+303F - Dingbats — Heavy variants of arithmetic symbols - Dingbats — Miscellaneous ranging U+2713-U+2752, U+2758-U+27BF - Miscellaneous Symbols — Genealogical symbols - Miscellaneous Symbols — Gender symbol - Miscellaneous Symbols — Recycling symbols - Control Pictures block - Miscellaneous Technical block Letterlike and Latin Symbols - Latin 1 block - Letterlike Symbols block The Ancient and Modern Writing symbols should just follow the practice of the scripts they're used in. Letterlike Symbols and actual letters overlap in usage, so it would be, imho, a bad idea to treat one different from the other. It would be odd, for example, for an inverted ohm sign to be treated differently from omega. Therefore I suggest making these sideways; they can be tipped upright, like the omega, explicitly if needed. (For consistency, the copyright sign and the registered sign should join the sound recording copyright sign in this.) But alternatively we could make them context-dependent; I am unsure whether that would give correct results, however, as it's unclear whether they should be set upright in CJK text or not. The broken bar is EAW=Na, which suggests a sideways setting. That leaves the the section sign, degree sign, and pilcrow sign (U+00A7, U+00B0, U+00B6) as ambiguous in Latin 1. Koji and I investigated arrows and drawing characters and concluded that these, too, should be set sideways including for CJK. Should the Ornaments be set sideways as well? For the various pictographic and geometric symbols, what is an appropriate setting? Should the snowman be upright or sideways? Does this depend on CJK vs. Latin context, or is it a stylistic preference, or does everybody just want them upright? The enclosed and ligated ideographic characters and the descriptive pieces should certainly be upright. What about enclosed Latin? It seems unlikely that the Latin block ligatures (units) would be used outside CJK context, so they do not need to be ambiguous. Should they be treated as letter-like symbols (set sideways by default) or be set upright by default? Characters with context-dependent orientation: - U+00A7, U+00B0, U+00B6 from Latin 1 - Potentially all pictographs, geometric shapes, and other symbols (recycling, etc) - Potentially enclosed alphanumerics - Potentially all letterlike symbols Suggestions for C (base orientation): In order to function correctly, the bidi algorithm needs a correctly-set base directionality for the paragraph. This is a property of the content, and not something that can be reliably auto-detected; this is why we have the 'dir' attribute in HTML. Likewise, to automatically resolve the text's orientation, we need to know its base orientation. We could use a heuristic like "check the first character in the paragraph", but since that will give the wrong answer if the paragraph begins with a foreign word or name (such as an English product name), we need to be able to encode this information in the HTML. This can be done with the HTML lang tag, which can accept script subtags from ISO 15924. If a document is tagged as lang="zh-Hant", we know it is written in traditional Chinese, and therefore will have an upright base orientation. Similarly if a document is tagged as lang="ja-Jpan", we know it is written in a combination of Han, Hiragana, and Katakana, and its base orientation is upright. A potentially tricky case is Mongolian, which should have ambiguous punctuation resolve to sideways, but if we choose to make pictographic symbols context-dependent, should trigger upright symbols. The question then is, what do we do if the script is not tagged (as it almost never will be)? Do we use a heuristic, or default to one orientation or another? If so, which one? Suggestions for D (author overrides): If ambiguous punctuation resolves to the wrong orientation, how can the author let us know the correct orientation (without wrapping the affected punctuation characters in <span>s, as that would probably give very awkward markup)? One potential answer would be to say that if an inline's 'text-orientation' property is different from the one that does auto-correction, or its script tag resolves to a different base orientation, then its contents are ignored during neutral resolution. That way a list's items can be individually wrapped with an effect similar to <bdi>. Conclusion ---------- Bidi-style neutral resolution may gain us some things, but it's not a panacea, and it introduces a lot of complexity. Pros: It solves a handful of ambiguous cases. Cons: Needs heuristics that are not very reliable and can give unexpected results. Understanding the why of those results and how to fix them will be hard. Unchanged: Needing to categorize all the symbols and punctuation as to how they should behave in vertical text; we just have a new bucket ("context-dependent") in addition to the others ("upright", "sideways", "stylistic choice, so default to sideways"). ~fantasai
Received on Friday, 1 July 2011 02:33:17 UTC