- From: Eric Muller <emuller@adobe.com>
- Date: Wed, 31 Aug 2011 20:25:29 -0700
- To: www-style <www-style@w3.org>
- Message-ID: <4E5EFB29.3010207@adobe.com>
What I envision is a property which classifies characters and an algorithm driven by it. This is the general approach adopted for Unicode processing: most often, characters which are added fall in one of the existing classes, and moving to a new version is just a matter to updating the table that that captures the classification, while the algorithm remains unchanged. Furthermore, this offers two points of tailoring: one can override the classification of a given character (or character occurrence in a document), and can override the rules (e.g. for different locales). In this case, the property would take enumerated values: U (upright), S (sideways), L (hangul leading jamo), V (hangul vowel jamo), T (hangul trailing jamo), C (controls), CM (combining marks). All but U and S are just to be clean wrt the Unicode machinery: - L, V, T primarily serve to get the combining jamo behavior. - CM is primarily to make combining marks "invisible" (i.e. they get the orientation of their base) - C is to avoid combining CM with them. (Actually, it's not entirely clear that we need L, V, T, since we don't deal with boundaries; we could instead assign those characters the same property value as hangul syllables, and we would probably be fine.) I have made an tentative assignment for the characters in Unicode 6.1, available at http://lists.w3.org/Archives/Public/www-archive/2011Sep/att-0000/vertical.html. To help the development of the property, this file does not list the CJK unified ideographs (which are U), nor the hangul syllables (which are S). The general approach is that characters which are obviously for ideographic contexts (ideographs, CJK punctuation, fullwidth clones, etc) are U, characters which are obviously for other writing systems are S, symbols are U. The arrows are a good example of a difficult situation. Sometimes they are used for something like "Press the → button", and the arrow should stay right pointing in vertical text; sometimes they are for something like "follow recipe A (→ page 342)" (the → stands for "see"), and should point downward in vertical text (I have examples of that in Japanese). Clearly, it is not possible to determine how a given occurrence of an arrow is used without analysis of the text which is way beyond what we can do. This is where tailoring by the document will be useful; alternatively, this may be treated by a mechanism where an alternative content is provided for vertical text (is there such a mechanism already in the drafts, or under development?) We will probably want to think about the values for the unassigned characters, to maximize forward compatibility. For example, all the code points in Plane 2 (U+2xxxx) are pretty much reserved for CJK ideographs, so assigning U to those is a good idea. Ignoring L, V, T, CM and C for the moment, every character is either U or S. We may want to have additional values, which would be resolved to either U or S via some rule a la bidi. This would allow to take the context of a character occurrence into account to determine whether that occurrence is U or S. However, my experience with contextual rules is that they are very delicate to develop. Furthermore, the kind of situation where this could be used is for something like units and suffix currencies, which could adopt the orientation of the character occurrence they follow; the bad news is that it's extremely hard to recognize units and currencies in plain text (not all units and currencies have dedicated characters). So I'd prefer to start without contextual rules, and see how problematic that is. (Just like we have both U+0041 A LATIN CAPITAL LETTER A and U+FF21 A FULLWIDTH LATIN CAPITAL LETTER A, many common units have the same dual support - e.g. "kg" and U+338F ㎏ SQUARE KG - and those who do not are much less likely to be written upright). Another reason to have additional values, and would stand even without contextual rules, is to be able to tailor for various locales. Eric.
Received on Thursday, 1 September 2011 03:25:58 UTC