- From: Eric Muller <emuller@adobe.com>
- Date: Tue, 13 Sep 2011 16:21:58 -0700
- To: Stephen Zilles <szilles@adobe.com>
- CC: www-style <www-style@w3.org>
- Message-ID: <4E6FE596.6010401@adobe.com>
On 9/13/2011 3:25 PM, Stephen Zilles wrote: > > The contextual rule is very simple. If a C class character is preceded > by an S class character it is set sideways and its class become S and > if it is preceded by a U class character it is set upright and its > class becomes U. There is no further textual analysis. > If you think in terms of semantic analysis, your C class is really that of suffixes. Right away, there is an obvious extension: prefixes, which would inherit their orientation from what follows them. E.g. for U+2116 № NUMERO SIGN. I understand how contextual classes can *in principle* alleviate the need for markup. But the question of whether it helps *in practice* remains. Let's take the case of U+2030 ‰ PER MILLE SIGN. The question is: of all the occurrences of this character in Japanese documents, what's the proportion of the cases where it should be displayed sideways? My bet is that it's less than 1%. Markup is just fine for such a small proportion. And I took a very favorable case for you: it's used predominantly as a suffix (the rule is reliable), and it's used all around the world (it can occur in sideways fragments - e.g. English - in a Japanese document). If you take a character like U+2116 № NUMERO SIGN, you are not going to find it much in English fragments, because we use the sequence N <sup>o</sup> instead, or something like that, so upright is going to be the right answer 99.9% of the time. If you look at U+20AC € EURO SIGN, it's used as a suffix in western fragments, but as a prefix in Japanese. > > If you accept this context rule, then the problem to solve is what > characters should be marked class “C”. My guess (and it is only a > guess) is that it is primarily the characters that are marked Common > in the data for UAX #24 of Unicode[1]. > Common is way too broad. Unicode labels a character common as soon as it is used in two different writing systems that use different scripts. For example, U+060C ، ARABIC COMMA is used as punctuation mark along with the Arabic, Syriac and Thaana scripts, and therefore labeled common. For our purposes, it's clearly "non Japanese" and should go sideways. The question you really want to answer is "is this character used both in the Japanese writing system and in a non-Japanese writing system". The Unicode data file ScriptsExtensions.txt is going in that direction, by listing for each common character the scripts where it is known to be in use. There we learn that U+060C ، ARABIC COMMA is only found with the three scripts I listed. There are common characters that are specifically "Japanese", e.g. U+30A0 ゠ KATAKANA-HIRAGANA DOUBLE HYPHEN and should go U (ScriptExtensions tells us that this one is only used with Hiragana and Katakana). However, ScriptsExtensions is relatively recent and not yet fully populated. Eric.
Received on Tuesday, 13 September 2011 23:22:39 UTC