RE: [css3-writing-modes] Character's intrinsic orientation from Koji Ishii on 2011-03-23 (public-i18n-cjk@w3.org from January to March 2011)

From: Koji Ishii <kojiishi@gluesoft.co.jp>
Date: Wed, 23 Mar 2011 14:05:18 -0400
To: Eric Muller <emuller@adobe.com>
CC: "www-style@w3.org" <www-style@w3.org>, "CJK discussion (public-i18n-cjk@w3.org)" <public-i18n-cjk@w3.org>
Message-ID: <A592E245B36A8949BDB0A302B375FB4E0AB3D55000@MAILR001.mail.lan>
Hi Eric, thank you for the feedback.

You're right that JLREQ does not provide guidance, but the logic I wrote here has been reviewed by JLTF members. Unfortunately, JLREQ classes is not comprehensive as you said, and is not for this purpose.


> So my first point is that the document author needs to
> be able to explicitly control the classification of any
> character occurrence.

Are you saying CSS should have a property so that authors can specify which code point is upright or rotated sideways? I don't think it's easy to define, to use, and to implement. I've never seen such products. Does InDesign support this? It may be nice, but it looks to me that it's too deep control which we can re-visit in future (level 4?).


> That being said, having an automatic, default, classification is a good idea.
> My second point is that the Unicode property EAW is not unusable for
> that purpose.

It looks like we at least agree to have a default classification. Thank you about that.

As you can see in EAW[1], it has typography/layout in its scope, and Recommendation section does recommend upright/sideways behavior. The rule is not perfect, but as far as I looked into, the combination of Script property and EAW gives me the result I wanted. I created a spreadsheet to verify the logic I wrote, and it's quite close to what Word or InDesign does.


> IMHO, it suffices to observe that U+0391 Α GREEK CAPITAL LETTER ALPHA is "A",
> while U+0370 Ͱ GREEK CAPITAL LETTER HETA is "N" and
> U+0531 Ա ARMENIAN CAPITAL LETTER AYB is "N".
> To me, this says that EAW is about the emulation of JIS systems in Unicode
> implementations, including handling only the subset of Unicode present in JIS.
> In 2011, this is no longer interesting.

That is the whole point of EAW; I don't assume people would write Greek documents in vertical text flow. People using these code points means that they appear within Japanese or Chinese context as a symbol; like alpha, beta, or ohm. JIS defined the common set of Greek letters used in Japanese context. You're right that JIS is no longer of interests in 2011, but the concept of "common set of Greek letters used in Japanese context" is still valid, and most fonts in the wild have glyphs for that purpose.

Note that "A" (Ambiguous) is UA- or font-dependent. CSS may define guidelines, but UA can do anything for that.


> Third point: I think it is a very bad idea to look
> at fonts to make the determination of upright or not.

Again, it's "should", not "must", so UA can do whatever they want for "A" code points.

There are some cases where you need to rely on fonts, like PUA. If the font is end-user defined characters for CJK, they must be upright, but PUA can be used for other purposes, so we have no way other than relying on fonts.

If there's vertical variant GSUB defined for a code point, it should be used. That's also part of font information. I understand some people doesn't like the logic relying on fonts, but we will have to.


> Furthermore, there is IMHO no good data in fonts to help you:
> the presence of anything related to vertical typesetting (vmtx,
> vorg, 'vert' feature, etc) is not telling you anything about which
> orientation to use, it's only telling you what to do once you have
> decided on an orientation. Of course, I would not support
> something like "advance equal to 1em".

You're right that TrueType/OpenType doesn't have that precise information unfortunately, so this must be a heuristic. I'm asking Ken Lunde and Microsoft guys for help, but we're unlikely to specify details in the spec anyway. We may show a few examples, but it's up to UA for how to determine it.


[1] http://unicode.org/reports/tr11/


-----Original Message-----
From: Eric Muller [mailto:emuller@adobe.com] 
Sent: Wednesday, March 23, 2011 2:49 AM
To: Koji Ishii
Cc: www-style@w3.org; CJK discussion (public-i18n-cjk@w3.org)
Subject: Re: [css3-writing-modes] Character's intrinsic orientation

While JLREQ[1] does not provide explicit guidance on which characters to rotate in vertical text, I do believe it provides very valuable clues. 
The most important consequence is that there is no completely algorithmic method to make the determination.

The place you want to look at is the classification of characters for the purpose of line justification (in particular Appendix A). Yes, it is for line justification, but I think that one chooses to rotate or not a character on the same basis as one chooses to space it. I am pretty sure that all cl-19 characters are upright and cl-27 are not, for example.

In the JLREQ approach, it's actually character *occurrences* which are classified, in one of the 30 classes described in Appendix A. Often, a character occurrence can be classified solely on the basis of the code
point: for example all occurrences of U+30A0 ゠ KATAKANA-HIRAGANA DOUBLE HYPHEN are classified as cl-03 Hyphens. On the other hand, occurrences of U+00AB « LEFT-POINTING DOUBLE ANGLE QUOTATION MARK can be classified either as cl-01 Opening brackets or cl-27 Western characters (U+00AB appears in both table A.1 and table A.27). Unfortunately, JLREQ provides no method to classify ambiguous occurrences ; I understand that the authors could not come to an agreement, and that this is mostly because there is no single right way to do it, but rather different house rules (e.g. one could treat « » around latin text in otherwise japanese text as either japanese or latin).

You will also notice that JLREQ limits itself to describing the characters in collections 285 and 286 of ISO/IEC 10646; this means that fullwidth characters are not listed at all, which is why you find U+0041 A LATIN CAPITAL LETTER A in cl-19 Ideographic characters, in cl-25 Unit symbols (with the remark "proportional"), and in cl-27 (also with the remark "proportional"). In practice, it seems that the desktop world relies on the existence of the fullwidth characters, and treats U+FF21
Ａ FULLWIDTH LATIN CAPITAL LETTER A as cl-19, and U+0041 as either cl-25 or cl-27. However, because the set of fullwidth characters is very limited, this is more an opportunistic convenience than a situation that can be relied on.

So my first point is that the document author needs to be able to explicitly control the classification of any character occurrence.

---

That being said, having an automatic, default, classification is a good idea. My second point is that the Unicode property EAW is not unusable for that purpose. IMHO, it suffices to observe that U+0391 Α GREEK CAPITAL LETTER ALPHA is "A", while U+0370 Ͱ GREEK CAPITAL LETTER HETA is "N" and U+0531 Ա ARMENIAN CAPITAL LETTER AYB is "N". To me, this says that EAW is about the emulation of JIS systems in Unicode implementations, including handling only the subset of Unicode present in JIS. In 2011, this is no longer interesting.

Of course, we can revisit EAW to make it do what we want now, but it remains that EAW as currently published and thought of is not the right basis.

---

Third point: I think it is a very bad idea to look at fonts to make the determination of upright or not.  I do believe that the orientation is something authors care about, should be able to count on (including when they leave the determination to the automatic default), and cannot be left at the mercy of user agent font fallback.

Furthermore, there is IMHO no good data in fonts to help you : the presence of anything related to vertical typesetting (vmtx, vorg, 'vert' 
feature, etc) is not telling you anything about which orientation to use, it's only telling you what to do once you have decided on an orientation. Of course, I would not support something like "advance equal to 1em".

---

Eric.
Received on Wednesday, 23 March 2011 18:07:46 UTC