Re: [css3-writing-modes] Character's intrinsic orientation from Eric Muller on 2011-03-22 (www-style@w3.org from March 2011)

From: Eric Muller <emuller@adobe.com>
Date: Tue, 22 Mar 2011 10:48:58 -0700
To: Koji Ishii <kojiishi@gluesoft.co.jp>
CC: "www-style@w3.org" <www-style@w3.org>, "CJK discussion (public-i18n-cjk@w3.org)" <public-i18n-cjk@w3.org>
Message-ID: <4D88E10A.9070606@adobe.com>

While JLREQ[1] does not provide explicit guidance on which characters to 
rotate in vertical text, I do believe it provides very valuable clues. 
The most important consequence is that there is no completely 
algorithmic method to make the determination.

The place you want to look at is the classification of characters for 
the purpose of line justification (in particular Appendix A). Yes, it is 
for line justification, but I think that one chooses to rotate or not a 
character on the same basis as one chooses to space it. I am pretty sure 
that all cl-19 characters are upright and cl-27 are not, for example.

In the JLREQ approach, it's actually character *occurrences* which are 
classified, in one of the 30 classes described in Appendix A. Often, a 
character occurrence can be classified solely on the basis of the code 
point: for example all occurrences of U+30A0 ゠ KATAKANA-HIRAGANA DOUBLE 
HYPHEN are classified as cl-03 Hyphens. On the other hand, occurrences 
of U+00AB « LEFT-POINTING DOUBLE ANGLE QUOTATION MARK can be classified 
either as cl-01 Opening brackets or cl-27 Western characters (U+00AB 
appears in both table A.1 and table A.27). Unfortunately, JLREQ provides 
no method to classify ambiguous occurrences ; I understand that the 
authors could not come to an agreement, and that this is mostly because 
there is no single right way to do it, but rather different house rules 
(e.g. one could treat « » around latin text in otherwise japanese text 
as either japanese or latin).

You will also notice that JLREQ limits itself to describing the 
characters in collections 285 and 286 of ISO/IEC 10646; this means that 
fullwidth characters are not listed at all, which is why you find U+0041 
A LATIN CAPITAL LETTER A in cl-19 Ideographic characters, in cl-25 Unit 
symbols (with the remark "proportional"), and in cl-27 (also with the 
remark "proportional"). In practice, it seems that the desktop world 
relies on the existence of the fullwidth characters, and treats U+FF21 
Ａ FULLWIDTH LATIN CAPITAL LETTER A as cl-19, and U+0041 as either cl-25 
or cl-27. However, because the set of fullwidth characters is very 
limited, this is more an opportunistic convenience than a situation that 
can be relied on.

So my first point is that the document author needs to be able to 
explicitly control the classification of any character occurrence.

---

That being said, having an automatic, default, classification is a good 
idea. My second point is that the Unicode property EAW is not unusable 
for that purpose. IMHO, it suffices to observe that U+0391 Α GREEK 
CAPITAL LETTER ALPHA is "A", while U+0370 Ͱ GREEK CAPITAL LETTER HETA is 
"N" and U+0531 Ա ARMENIAN CAPITAL LETTER AYB is "N". To me, this says 
that EAW is about the emulation of JIS systems in Unicode 
implementations, including handling only the subset of Unicode present 
in JIS. In 2011, this is no longer interesting.

Of course, we can revisit EAW to make it do what we want now, but it 
remains that EAW as currently published and thought of is not the right 
basis.

---

Third point: I think it is a very bad idea to look at fonts to make the 
determination of upright or not.  I do believe that the orientation is 
something authors care about, should be able to count on (including when 
they leave the determination to the automatic default), and cannot be 
left at the mercy of user agent font fallback.

Furthermore, there is IMHO no good data in fonts to help you : the 
presence of anything related to vertical typesetting (vmtx, vorg, 'vert' 
feature, etc) is not telling you anything about which orientation to 
use, it's only telling you what to do once you have decided on an 
orientation. Of course, I would not support something like "advance 
equal to 1em".

---

Eric.

Received on Tuesday, 22 March 2011 17:49:42 UTC