Re: use case for font-dependent default orientation

On 9/13/2011 3:25 PM, Stephen Zilles wrote:
>
> The contextual rule is very simple. If a C class character is preceded 
> by an S class character it is set sideways and its class become S and 
> if it is preceded by a U class character it is set upright and its 
> class becomes U. There is no further textual analysis.
>

If you think in terms of semantic analysis, your C class is really that 
of suffixes. Right away, there is an obvious extension: prefixes, which 
would inherit their orientation from what follows them. E.g. for U+2116 
№ NUMERO SIGN.

I understand how contextual classes can *in principle* alleviate the 
need for markup. But the question of whether it helps *in practice* 
remains. Let's take the case of U+2030 ‰ PER MILLE SIGN. The question 
is: of all the occurrences of this character in Japanese documents, 
what's the proportion of the cases where it should be displayed 
sideways? My bet is that it's less than 1%. Markup is just fine for such 
a small proportion.

And I took a very favorable case for you: it's used predominantly as a 
suffix (the rule is reliable), and it's used all around the world (it 
can occur in sideways fragments - e.g. English - in a Japanese 
document). If you take a character like U+2116 № NUMERO SIGN, you are 
not going to find it much in English fragments, because we use the 
sequence N <sup>o</sup> instead, or something like that, so upright is 
going to be the right answer 99.9% of the time. If you look at U+20AC € 
EURO SIGN, it's used as a suffix in western fragments, but as a prefix 
in Japanese.


>
> If you accept this context rule, then the problem to solve is what 
> characters should be marked class “C”. My guess (and it is only a 
> guess) is that it is primarily the characters that are marked Common 
> in the data for UAX #24 of Unicode[1].
>

Common is way too broad. Unicode labels a character common as soon as it 
is used in two different writing systems that use different scripts. For 
example, U+060C ، ARABIC COMMA is used as punctuation mark along with 
the Arabic, Syriac and Thaana scripts, and therefore labeled common. For 
our purposes, it's clearly "non Japanese" and should go sideways.

The question you really want to answer is "is this character used both 
in the Japanese writing system and in a non-Japanese writing system". 
The Unicode data file ScriptsExtensions.txt is going in that direction, 
by listing for each common character the scripts where it is known to  
be in use. There we learn that U+060C ، ARABIC COMMA is only found with 
the three scripts I listed. There are common characters that are 
specifically "Japanese", e.g. U+30A0 ゠ KATAKANA-HIRAGANA DOUBLE HYPHEN 
and should go U  (ScriptExtensions tells us that this one is only used 
with Hiragana and Katakana). However, ScriptsExtensions is relatively 
recent and not yet fully populated.

Eric.

Received on Tuesday, 13 September 2011 23:22:39 UTC