- From: fantasai <fantasai.lists@inkedblade.net>
- Date: Thu, 30 Jun 2011 19:32:51 -0700
- To: "www-style@w3.org" <www-style@w3.org>
John Daggett wanted me to write a spec for bidi-style resolution of punctuation
orientation for vertical text.
Let's first review how the bidi algorithm deals with neutrals. Note: I'm going
to simplify this a little; if you want the full details, go read the bidi spec.
http://unicode.org/reports/tr9/
Neutral Resolution in UAX9 (simplified)
---------------------------------------
All characters in Unicode are classified as having an intrinsic directionality:
- left (L) for LTR scripts
- right (R) for RTL scripts
- neutral (N) for symbols and punctuation
The neutral characters must be resolved to either L or R before reordering happens.
This is done by looking at their surrounding characters:
* neutrals surrounded by L become L
LNNNL -> LLLLL
* neutrals surrounded by R become R
RNNNR -> RRRRR
But sometimes this is not enough to make a clear distinction:
- LNNNR -> L???R
This is where the paragraph's base directionality comes in. In an LTR paragraph,
such conflicts resolve to L, and in an RTL paragraph they resolve to R:
<p dir=ltr>LNNNR</p> -> <p dir=ltr>LLLLR</p>
<p dir=rtl>LNNNR</p> -> <p dir=rtl>LRRRR</p>
The base paragraph direction is also used to set the direction of the start and
end of the paragraph, so that neutrals at the beginning and end of the text
(such as quotes and periods) resolve correctly.
<p dir=ltr>LRN</p> -> <p dir=ltr>LRL</p>
<p dir=ltr>RLN</p> -> <p dir=ltr>RLR</p>
These rules work correctly in many cases, but not all. For example, if you are
listing words or phrases in the opposite direction (such as English product
names in an Arabic sentence), the neutrals will resolve incorrectly.
List of words: AAA, BBB, CCC. -> List of words: CCC ,BBB ,AAA.
To work around this, Unicode has invisible characters with a strong directionality,
which can be strategically placed to break the run:
List of words: AAA‎, BBB‎, CCC. -> List of words: AAA, BBB, CCC.
In CSS3/HTML5, we can instead isolate the phrase, so that it is resolved
independently internally, and does not affect anything outside:
List of words: <bdi>AAA</bid>, <bdi>BBB</bdi>, <bdi>CCC</bdi>. <!-- totally safe -->
Resolving Neutral Orientation
-----------------------------
Now with some background on how the bidi algorithm works, we can look at how to
map this behavior to resolving text orientation. There are several open questions:
A) Which characters are neutral for orientation resolution?
B) Which characters are affected by orientation resolution?
C) How do we determine the base orientation?
D) How do we give the author the ability to change the automatic behavior when
it's wrong?
Suggestion for A (characters neutral for orientation resolution):
Treat all punctuation (P*), symbols (S*), and ASCII digits (0-9) as neutral
for the purposes of orientation resolution.
Suggestions for B (characters affected by orientation resolution):
Not all neutral characters have an ambiguous orientation that should be
resolved automatically. In fact, very few do. We need to make sure that
we only automatically determine the orientation of characters whose
orientation is subject to change based on context. Characters that have
a definite orientation, or for whom orientation is a stylistic (rather
than contextual) preference, should not be subject to this automatic
resolution.
1. First, we can dismiss any symbols and punctuation that belong to a
particular script (i.e. are not Common) as taking the orientation
of that script.
2. Second, we can cut out a lot of the punctuation by general category:
http://www.unicode.org/reports/tr44/#General_Category_Values
* dashes and connector punctuation (Pd, Pc) should be rendered to
extend from top to bottom; if the font has vertical metrics, we
can assume that it handles this already, otherwise they should
be set sideways
* enclosing punctuation (Pi, Ps, Pf, Pe) should enclose the text;
again, if the font has vertical metrics, we can assume that it
handles this already, otherwise they should be set sideways
This leaves only the Other Punctuation category, Po.
3. Third, we can cut out a lot of the symbols by general category as well:
Currency Symbols (Sc) and Math Symbols (Sm) are typically set sideways
even in vertical CJK, and therefore shouldn't be subject to automatic
resolution. This leaves only the Modifier Symbol (Sk) and Other Symbols
(So) categories.
4. Let's look more closely at the Sk category:
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AGeneral_Category%3DSk%3A]%26[%3AScript%3DCommon%3A]
* Most of these are related to either Latin or IPA or UPA -> sideways
* There's a Greek character -> sideways
* And two voicing marks for kana -> upright
(These are listed in ScriptExtensions.txt as part of Hira and Kana.)
* The last three are fullwidth characters -> upright
* The rest consists of several blocks of "tone letters", which
are also intended for use with Latin-based phonetic transcriptions
(See Unicode 6.0 section 7.8) -> sideways
Nothing left that's ambiguous.
5. Let's look more closely at the Po category:
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AGeneral_Category%3DSo%3A]%26[%3AScript%3DCommon%3A
Working backward (because it's easier):
* Aegean numbers should be listed in ScriptExtensions.txt -- that's
a Unicode bug
* U+FF61, U+FF64, U+FF65 are CJK halfwidth punctuation and should be
set accordingly
* The Fullwidth and Small form variants should be set upright, since
as Wide characters they can be assumed to belong with CJK.
* Overscores and underscores should be set sideways, parallel with
the baseline
* Sesame dots, vertical presentation forms, and other CJK punctuation
should be set upright.
* U+2E18-U+2E31 I'm unsure of.
* Ancient Greek and New Testament symbols should be set sideways
* Archaic punctuation (U+2056-U+205E) is unclear
* Double punctuation for vertical text should be upright
* General punctuation U+2016-U+2057 could be ambiguous
* The characters from the Vedic, Mongolian, Hanunoo, Runic, Georgian,
Devanagari, Arabic, Armenian, and Greek And Coptic blocks should
of course be set sideways.
* Basic Latin and Latin 1 Supplement (U+0021-U+0387) remain ambiguous
Characters with context-dependent orientation:
Common Other Punctuation from the ranges U+2016-U+2057 and U+0021-U+0387,
and potentially also U+2E18-U+2E31 and U+2056-U+205E.
6. Now let's look more closely at the So category:
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%3AGeneral_Category%3DSo%3A]%26[%3AScript%3DCommon%3A]&g=
We can roughly group the symbols into
Alchemical Symbols
- Alchemical Symbols block
Musical Symbols
- Musical Symbols block
- Byzantine Musical Symbols block
- Miscellaneous Symbols — Musical symbols
Religious / Cultural Symbols
- Tibetan — Religious symbols
- Tai Xuan Jing Symbols block
- Yijing Hexagram Symbols block
- Miscellaneous Symbols — Medical and healing symbols
- Miscellaneous Symbols — Religious and political symbols
- Miscellaneous Symbols — Yijing trigram symbols
- Miscellaneous Symbols — Astrological signs
- Miscellaneous Symbols — Zodiacal symbol
- Miscellaneous Symbols — Yijing monogram and digram symbols
Geometric shapes
- Miscellaneous Symbols And Arrows - anything not "arrows"
- Dingbats — Stars, asterisks and snowflakes
- Dingbats — Crosses
- Miscellaneous Symbols — Syriac cross symbols
- Miscellaneous Symbols — Pentagram symbols
- Miscellaneous Symbols — Circles
- Geometric Shapes — Control code graphics
- Geometric Shapes — Geometric shapes
Drawing
- Box Drawing block
- Block Elements block
Arrows
- Arrows block
- Miscellaneous Symbols And Arrows - anything with "arrows"
- Dingbats - Dingbat arrow
- Miscellaneous Symbols — Pointing hand symbols
Game Symbols
- Playing Cards block
- Domino Tiles block
- Mahjong Tiles block
- Miscellaneous Symbols — Chess symbols
- Miscellaneous Symbols — Playing card symbols
- Miscellaneous Symbols — Go markers
- Miscellaneous Symbols — Dice
- Miscellaneous Symbols — Symbols for draughts and checkers
- Miscellaneous Symbols — Japanese chess symbols
Pictographs
- Transport and Map Symbols block
- Miscellaneous Symbols and Pictographs block
- Phaistos Disc
- Dingbats — Miscellaneous ranging U+2701-U+2712
- Miscellaneous Symbols - blocks not mentioned elsewhere
Emoticons
- Miscellaneous Symbols — Emoticons
- Emoticons block
Enclosed Text
- Enclosed Ideographic Supplement block
- Enclosed Alphanumeric Supplement block
- Enclosed CJK Letters And Months block except
- Squared Latin abbreviation
- Telegraph symbols for months
Block Ligatures - Latin
- Enclosed CJK Letters and Months - Squared Latin abbreviation
- CJK Compatibility — Squared Latin abbreviation
Block Ligatures - CJK Ideographic
- CJK Compatibility (everything else)
- Enclosed CJK Letters and Months — Telegraph symbols for months
Ancient Writing
- Aegean Numbers block
- Ancient Symbols block (Roman symbols)
Modern Writing
- Halfwidth and Fullwidth Forms block
- Arabic Presentation Forms A block
- Common Indic Number Forms block
Ideographic Pieces
- Ideographic Description Characters block
- Kanbun block
- CJK Strokes block
Ornaments
- Dingbats — Punctuation ornaments
- Dingbats — Miscellaneous U+2753-U+2757
Other Symbols
- CJK Symbols and Punctuation - U+3004-U+303F
- Dingbats — Heavy variants of arithmetic symbols
- Dingbats — Miscellaneous ranging U+2713-U+2752, U+2758-U+27BF
- Miscellaneous Symbols — Genealogical symbols
- Miscellaneous Symbols — Gender symbol
- Miscellaneous Symbols — Recycling symbols
- Control Pictures block
- Miscellaneous Technical block
Letterlike and Latin Symbols
- Latin 1 block
- Letterlike Symbols block
The Ancient and Modern Writing symbols should just follow the practice
of the scripts they're used in.
Letterlike Symbols and actual letters overlap in usage, so it would
be, imho, a bad idea to treat one different from the other. It would
be odd, for example, for an inverted ohm sign to be treated differently
from omega. Therefore I suggest making these sideways; they can be
tipped upright, like the omega, explicitly if needed. (For consistency,
the copyright sign and the registered sign should join the sound
recording copyright sign in this.) But alternatively we could make
them context-dependent; I am unsure whether that would give correct
results, however, as it's unclear whether they should be set upright in
CJK text or not.
The broken bar is EAW=Na, which suggests a sideways setting.
That leaves the the section sign, degree sign, and pilcrow sign
(U+00A7, U+00B0, U+00B6) as ambiguous in Latin 1.
Koji and I investigated arrows and drawing characters and concluded
that these, too, should be set sideways including for CJK.
Should the Ornaments be set sideways as well?
For the various pictographic and geometric symbols, what is an
appropriate setting? Should the snowman be upright or sideways?
Does this depend on CJK vs. Latin context, or is it a stylistic
preference, or does everybody just want them upright?
The enclosed and ligated ideographic characters and the descriptive
pieces should certainly be upright. What about enclosed Latin?
It seems unlikely that the Latin block ligatures (units) would be
used outside CJK context, so they do not need to be ambiguous.
Should they be treated as letter-like symbols (set sideways by
default) or be set upright by default?
Characters with context-dependent orientation:
- U+00A7, U+00B0, U+00B6 from Latin 1
- Potentially all pictographs, geometric shapes, and other symbols
(recycling, etc)
- Potentially enclosed alphanumerics
- Potentially all letterlike symbols
Suggestions for C (base orientation):
In order to function correctly, the bidi algorithm needs a correctly-set
base directionality for the paragraph. This is a property of the content,
and not something that can be reliably auto-detected; this is why we have
the 'dir' attribute in HTML.
Likewise, to automatically resolve the text's orientation, we need to
know its base orientation. We could use a heuristic like "check the first
character in the paragraph", but since that will give the wrong answer if
the paragraph begins with a foreign word or name (such as an English product
name), we need to be able to encode this information in the HTML.
This can be done with the HTML lang tag, which can accept script subtags
from ISO 15924. If a document is tagged as lang="zh-Hant", we know it is
written in traditional Chinese, and therefore will have an upright base
orientation. Similarly if a document is tagged as lang="ja-Jpan", we know
it is written in a combination of Han, Hiragana, and Katakana, and its
base orientation is upright.
A potentially tricky case is Mongolian, which should have ambiguous
punctuation resolve to sideways, but if we choose to make pictographic
symbols context-dependent, should trigger upright symbols.
The question then is, what do we do if the script is not tagged (as it
almost never will be)? Do we use a heuristic, or default to one orientation
or another? If so, which one?
Suggestions for D (author overrides):
If ambiguous punctuation resolves to the wrong orientation, how can the
author let us know the correct orientation (without wrapping the affected
punctuation characters in <span>s, as that would probably give very awkward
markup)?
One potential answer would be to say that if an inline's 'text-orientation'
property is different from the one that does auto-correction, or its script
tag resolves to a different base orientation, then its contents are ignored
during neutral resolution. That way a list's items can be individually
wrapped with an effect similar to <bdi>.
Conclusion
----------
Bidi-style neutral resolution may gain us some things, but it's not a panacea,
and it introduces a lot of complexity.
Pros:
It solves a handful of ambiguous cases.
Cons:
Needs heuristics that are not very reliable and can give unexpected results.
Understanding the why of those results and how to fix them will be hard.
Unchanged:
Needing to categorize all the symbols and punctuation as to how they should
behave in vertical text; we just have a new bucket ("context-dependent")
in addition to the others ("upright", "sideways", "stylistic choice, so
default to sideways").
~fantasai
Received on Friday, 1 July 2011 02:33:17 UTC