Strawman proposal for UTR #50: Unicode Properties for Vertical Text Layout

What I envision is a property which classifies characters and an 
algorithm driven by it. This is the general approach adopted for Unicode 
processing: most often, characters which are added fall in one of the 
existing classes, and moving to a new version is just a matter to 
updating the table that that captures the classification, while the 
algorithm remains unchanged. Furthermore, this offers two points of 
tailoring: one can override the classification of a given character (or 
character occurrence in a document), and can override the rules (e.g. 
for different locales).

In this case, the property would take enumerated values: U (upright), S 
(sideways), L (hangul leading jamo), V (hangul vowel jamo), T (hangul 
trailing jamo), C (controls), CM (combining marks).

All but U and S are just to be clean wrt the Unicode machinery:
- L, V, T primarily serve to get the combining jamo behavior.
- CM is primarily to make combining marks "invisible" (i.e. they get the 
orientation of their base)
- C is to avoid combining CM with them.

(Actually, it's not entirely clear that we need L, V, T, since we don't 
deal with boundaries; we could instead assign those characters the same 
property value as hangul syllables, and we would probably be fine.)

I have made an tentative assignment for the characters in Unicode 6.1, 
available at 
http://lists.w3.org/Archives/Public/www-archive/2011Sep/att-0000/vertical.html. 
To help the development of the property, this file does not list the CJK 
unified ideographs (which are U), nor the hangul syllables (which are 
S). The general approach is that characters which are obviously for 
ideographic contexts (ideographs, CJK punctuation, fullwidth clones, 
etc) are U, characters which are obviously for other writing systems are 
S, symbols are U.

The arrows are a good example of a difficult situation. Sometimes they 
are used for something like "Press the → button", and the arrow should 
stay right pointing in vertical text; sometimes they are for something 
like "follow recipe A (→ page 342)" (the → stands for "see"), and should 
point downward in vertical text (I have examples of that in Japanese). 
Clearly, it is not possible to determine how a given occurrence of an 
arrow is used without analysis of the text which is way beyond what we 
can do. This is where tailoring by the document will be useful; 
alternatively, this may be treated by a mechanism where an alternative 
content is provided for vertical text (is there such a mechanism already 
in the drafts, or under development?)

We will probably want to think about the values for the unassigned 
characters, to maximize forward compatibility. For example, all the code 
points in Plane 2 (U+2xxxx) are pretty much reserved for CJK ideographs, 
so assigning U to those is a good idea.

Ignoring L, V, T, CM and C for the moment, every character is either U 
or S. We may want to have additional values, which would be resolved to 
either U or S via some rule a la bidi. This would allow to take the 
context of a character occurrence into account to determine whether that 
occurrence is U or S. However, my experience with contextual rules is 
that they are very delicate to develop. Furthermore, the kind of 
situation where this could be used is for something like units and 
suffix currencies, which could adopt the orientation of the character 
occurrence they follow; the bad news is that it's extremely hard to 
recognize units and currencies in plain text (not all units and 
currencies have dedicated characters). So I'd prefer to start without 
contextual rules, and see how problematic that is. (Just like we have 
both U+0041 A LATIN CAPITAL LETTER A and U+FF21 A FULLWIDTH LATIN 
CAPITAL LETTER A, many common units have the same dual support - e.g. 
"kg" and U+338F ㎏ SQUARE KG - and those who do not are much less likely 
to be written upright).

Another reason to have additional values, and would stand even without 
contextual rules, is to be able to tailor for various locales.

Eric.

Received on Thursday, 1 September 2011 03:25:58 UTC