RE: Strawman proposal for UTR #50: Unicode Properties for Vertical Text Layout

Hi Eric, I'm sorry for responding to a very old e-mail, but here's my first pass to go through your proposal.

1. I don't understand why we need L/V/T here. They are part of a grapheme cluster. Shouldn't they just follow conventions defined in UAX#29 UNICODE TEXT SEGMENTATION[1]?

2. When determining orientation for a grapheme cluster, I agree with "they get the orientation of their base", but I have one open issue here--CIRCLED alpha-numerics. They (e.g., U+2460 CIRCLED DIGIT ONE) will be upright as in your proposal and in the current spec, but "A" + U+20DD COMBINING ENCLOSING CIRCLE will be sideways if we follow the current definition. Do you have any idea to solve this issue?

3. I can't see clear direction for how you want to handle unified punctuation characters. Some are U, while some are S. I often refer to MS Word and Adobe InDesign, which you responded I shouldn't worry about exporting their data in a separate thread. I agree with you, I'm not worrying about exporting, but I do worry about their behavior, since that's what East Asian users are used to for decades, and is very likely to be what they would expect to see in browsers. Since it's a common behavior, even plain text has assumptions that some code points appear in upright in any vertical flow software. Following are examples of such possible problems.
U+00B1 PLUS-MINUS SIGN
U+00B7 MIDDLE DOT (Chinese only, it's middle dot, so one may not notice though)
U+00F7 DIVISION SIGN
U+2030 PER MILLE SIGN
U+203B REFERENCE MARK (Again this may not notice)
U+2103 DEGREE CELSIUS
U+2116 NUMERO SIGN
U+2121 TELEPHONE SIGN
I myself are back and forth between multilingual capability and existing behaviors. Since this is a vertical flow feature for East Asian, I'm leaning to prioritize existing behavior more these days, but I hope we can discuss more on this.

4. Similar to the above issue, but Greek is more problematic than I originally thought. Greek letters in legacy East Asian encoding are not perfect and making them upright is problematic when Greek text are mixed within East Asian vertical flow. Both your proposal and the current spec prioritize multilingual capability than traditions and therefore making them upright. However, some letters like U+03A9 GREEK CAPITAL LETTER OMEGA is used as unit symbols as well and people expects them to appear upright. There's U+2126 OHM SIGN, but not many fonts support the code points, and legacy encoding mappings map to U+03A9. I tried Windows 7 Input Method and it only suggest U+03A9 for OHM SIGN, so I expect a lot of existing documents will be broken if we make this sideways. I don't have good solution at this point though.

5. I didn't find PUA in your chart. From what I understand, East Asian want them upright, while other scripts may want sideways. It might be okay to make them upright if we have consensus that the feature is primarily for East Asian vertical flow, but can I have your thoughts on this block?

6. I agree that arrows are difficult situation, but if both cases exist, and if we have to pick one, I'd choose sideways as in the current CSS3 Writing Modes spec. Arrows are ambiguous, but Box Drawings are very clear to be sideways. If we make Box Drawing to sideways, I think arrows behaving the same way is less confusing.

7. CSS3 Writing Modes Appendix B: Bi-orientational Transformations[2] defines Egyp, Hang, and Yi to be sideways, while your proposal defines upright. I have no idea how they should be, which is correct?

[1] http://unicode.org/reports/tr29/

[2] http://dev.w3.org/csswg/css3-writing-modes/#script-orientations


Regards,
Koji

------
From: www-style-request@w3.org [mailto:www-style-request@w3.org] On Behalf Of Eric Muller
Sent: Thursday, September 01, 2011 12:25 PM
To: www-style
Subject: Strawman proposal for UTR #50: Unicode Properties for Vertical Text Layout

What I envision is a property which classifies characters and an algorithm driven by it. This is the general approach adopted for Unicode processing: most often, characters which are added fall in one of the existing classes, and moving to a new version is just a matter to updating the table that that captures the classification, while the algorithm remains unchanged. Furthermore, this offers two points of tailoring: one can override the classification of a given character (or character occurrence in a document), and can override the rules (e.g. for different locales).

In this case, the property would take enumerated values: U (upright), S (sideways), L (hangul leading jamo), V (hangul vowel jamo), T (hangul trailing jamo), C (controls), CM (combining marks).

All but U and S are just to be clean wrt the Unicode machinery:
- L, V, T primarily serve to get the combining jamo behavior. 
- CM is primarily to make combining marks "invisible" (i.e. they get the orientation of their base)
- C is to avoid combining CM with them. 

(Actually, it's not entirely clear that we need L, V, T, since we don't deal with boundaries; we could instead assign those characters the same property value as hangul syllables, and we would probably be fine.)

I have made an tentative assignment for the characters in Unicode 6.1, available at http://lists.w3.org/Archives/Public/www-archive/2011Sep/att-0000/vertical.html. To help the development of the property, this file does not list the CJK unified ideographs (which are U), nor the hangul syllables (which are S). The general approach is that characters which are obviously for ideographic contexts (ideographs, CJK punctuation, fullwidth clones, etc) are U, characters which are obviously for other writing systems are S, symbols are U.

The arrows are a good example of a difficult situation. Sometimes they are used for something like "Press the → button", and the arrow should stay right pointing in vertical text; sometimes they are for something like "follow recipe A (→ page 342)" (the → stands for "see"), and should point downward in vertical text (I have examples of that in Japanese). Clearly, it is not possible to determine how a given occurrence of an arrow is used without analysis of the text which is way beyond what we can do. This is where tailoring by the document will be useful; alternatively, this may be treated by a mechanism where an alternative content is provided for vertical text (is there such a mechanism already in the drafts, or under development?)

We will probably want to think about the values for the unassigned characters, to maximize forward compatibility. For example, all the code points in Plane 2 (U+2xxxx) are pretty much reserved for CJK ideographs, so assigning U to those is a good idea. 

Ignoring L, V, T, CM and C for the moment, every character is either U or S. We may want to have additional values, which would be resolved to either U or S via some rule a la bidi. This would allow to take the context of a character occurrence into account to determine whether that occurrence is U or S. However, my experience with contextual rules is that they are very delicate to develop. Furthermore, the kind of situation where this could be used is for something like units and suffix currencies, which could adopt the orientation of the character occurrence they follow; the bad news is that it's extremely hard to recognize units and currencies in plain text (not all units and currencies have dedicated characters). So I'd prefer to start without contextual rules, and see how problematic that is. (Just like we have both U+0041 A LATIN CAPITAL LETTER A and U+FF21 A FULLWIDTH LATIN CAPITAL LETTER A, many common units have the same dual support - e.g. "kg" and U+338F ㎏ SQUARE KG - and those who do not are much less likely to be written upright). 

Another reason to have additional values, and would stand even without contextual rules, is to be able to tailor for various locales.

Eric.

Received on Wednesday, 28 September 2011 06:32:47 UTC