L2/11-327

A property for Japanese text layout

 

August 5, 2011
Eric Muller, Adobe Systems

 

Vertical text

When Japanese text is presented vertically, some character occurrences are presented upright (e.g. occurrences of の) while other are presented on their side, or rotated (e.g. occurrences of arabic).

(Additionally some characters adopt a slightly different glyph when they are in a vertical line rather that an horizontal line, e.g. the ideographic punctuation, square katakana; however, that is not the subject of the discussion here.)

For many character occurrences, the practice is well understood and there is a de facto standard behavior: every engine will present the kanji and kana character occurrences upright, the fullwidth latin character occurrences upright and the ASCII character occurrences rotated. However, that de facto standardization does not extend to all characters in Unicode; yet, the orientation of characters can be critical to the meaning of the text (think arrows), so that we have an interchange problem.

One possible approach is to say that for all the characters occurrences for which there is no de facto agreement and the result matters, the author should use markup to indicate the proper orientation. However, this approach is unsatisfactory in two respects: first, the extent of the de facto standard (i..e the character occurrence for which all rendering engine will produce the same answer) is not well-defined; and second, because that scope is certainly a subset of Unicode, author will need to include a fair amount of markup - or stated another, authoring documents becomes difficult.

So what we need is a 1) de jure standard, that 2) covers all of Unicode. It is unlikely that the orientation specified by this standard will always be the desired one, so there is need to be room in markup systems to override the determinations of this standard, but it is also desirable to 3) make the determination as good as possible, to minimize the use of markup.

We envision that this de jure standard will take the form of a property that drives an algorithm formulated as rules. much like linebreak or word boundaries (of course, the rules would not be of the form "there is/is not a break opportuny/word boundary at this position in this pattern of classes", but of the form "the occurrence at this position in this pattern of classes is upright/rotated"). Rules could be tailorable, e.g. to account for differences between different locales.

There are a number of standardization organizations that could develop such a standard: JIS, the W3C, font format specifiers, Unicode.We believe that Unicode is the best suited organization. It has access to the proper expertise, the apparatus to distribute such a property+algorithm, and is in the best position to update it as new versions of the Unicode standard are release.

Proposal 1 is to develop a Unicode enumerated property+algorightm. I am willing to take on the lead on both the development and the maintenance of this effort.

This proposal was arrived at after extensive discussions with the W3C CSS working group, in particular John Dagget, who supports it.

Line formation and justification

Another aspect of Japanese typography is that line formation and justification follow a different model than Western typography. Whereas the latter essentially puts the glyphs one after the other, and adjusts with width of spaces (which correspond to characters) to justify lines, Japanese typography has a model of adding space on each side of each character, in quantities that depend on the context, and has a hierarchy among those spaces to achieve justification (see "Requirements for Japanese Text Layout", W3C Working Group Note 4, June 2009, at <http://www.w3.org/TR/2009/NOTE-jlreq-20090604/>).

This processing is also best driven by a classification of characters, and the JLREQ document provides some description of the behaviour of the characters in the Basic Japanese character set (subset 285 of ISO/IEC 10646). However, it's not enough for implementation, and it does not cover the full Unicode character set. The group which produces JLREQ declined to elaborate their work, to make it more useful for implementers.

There is evidently a strong alignment between orientation and line formation/justification. For example, "ideographs", "katakana", "western character" are classes useful for line formation/justification. It most likely would create little additional work to make the property that supports orientation also useful for line formation. This extended property (extended in the sense of having a few more classes) would be used to drive the algorithm describe in JLREQ which is in good enough shape for implementation.

Proposal 2, conditional on Proposal 1, is to make the enumeration of the property fine enough to support line formation and justification.

This proposal was not discussed much with the W3C. Also, it is less critical to interchange.

---