Re: 明後日3日のお題 from Nat McCully on 2020-11-01 (public-i18n-japanese@w3.org from October to December 2020)

From: Nat McCully <nmccully@adobe.com>
Date: Sun, 1 Nov 2020 18:31:40 +0000
To: 木田泰夫 <kida@mac.com>, Kobayashi Toshi <binn@k.email.ne.jp>
CC: "public-i18n-japanese@w3.org" <public-i18n-japanese@w3.org>
Message-ID: <MW2PR02MB3659AF517671780FF3E109DFD7130@MW2PR02MB3659.namprd02.prod.outlook.com>
仮想文字組みクラス（文字コードベースのみより）

仮想クラスは必要です。でも仮想クラスの数は多いかもしれないです。うちは縦中横を「その他の和字」として認識しているのですが、JLReqに縦中横は何の挙動なのかは漢字などと別のところに指定すべきです。「その他の和字」とは違う場合は歴史上にはあったのであればその情報も記述して欲しいです。割注の括弧は普通の括弧とどう違うかもそうです。今は同じクラスで、同じ挙動。でも、歴史上では違ったよ、とあれば記述して欲しいです。ルビの下のテキストもそうです。
段頭、行頭、行末の仮想クラスは必須です。なぜかというと括弧類との関係は特別で、ハウスルールはそれぞれ。

国際化

UAX14との関係も、UAX50との関係も、日本語組版の伝統的な必須な挙動とは違うところを洗い出して、gap analysisは議論すべきでしょう。私は一番繰り返して説明しなければいけないところは、日本語組版にはone size fits allは非常に難しいところです。標準は決められないか、決めても40%のユーザーはカスタムしている、とか。Unicodeはその点についてはあまり納得いかないところが印象です。

以下は英語で申し訳ないです！

I want to make sure my poor Japanese is not detracting from what I have to say about JLReq and how it relates to implementations (especially those trying to fit Japanese composition rules into a larger more international context). What I will say below may strike you all as too general to discuss, but I hope we can get into more detail as we cover the topics. It has been difficult for me to come up with enough specifics immediately from the beginning, as I prefer to set a general context and let the details come out in the discussion, but I sense some dissatisfaction in this approach. I will try to spend some time describing specifics I have run into as a way to get there…

JLReq’s critical role, augmenting UAX##

What JLReq needs to do is specify where an engine must accommodate these customizations (where one size does not fit all), and explain why they are important. If there is no explanation as to the importance, these customizations are likely to be dropped, and what we will get is only slightly better than what we have today. Truly professional results will still be very difficult to automate or impossible on the Web.

Internationalization of the JLReq

I think this is a very good direction for the group to be focusing on – how to relate the JLReq to other international standards and conventions. What I have always discovered is, traditional Japanese publishing and the way digital fonts have evolved, and then how Unicode tried to simplify and unify the various encodings and glyph designs, have caused incompatibilities that then engines have had to deal with in custom ways. There is a reason InDesign’s solution is different even from other Adobe apps, to say nothing about other companies or platforms – InDesign tried to remain faithful to the print publishing industry, trying to reproduce what was done by hand on proprietary equipment before the age of Unicode and SJIS and digital fonts with a position origin at the left side Roman baseline and a height of ascent plus descent. And, all 2b glyphs being the same width of 1 em. These were technological limitations that we inherited, and InDesign tried internally to separate these issues from the user so the user could still think in terms that were natural to them – the 仮想ボディー, the center-based positioning and baseline and gridding, the fact that 括弧類 are considered 半角 when set flush.

Japanese line breaking vs Unicode line breaking for all

So, one next step is to analyze UAX14 line breaking and determine how important it is for the user to customize the kinsoku set beyond what is stated there, and whether it is important for the line compression to be related to the presence of kinsoku at line end (how a UAX14-based engine relates line layout to its classifications to support 追い込み処理 etc). How would that change if UAX14 was used? Could an engine use UAX14 as its default in all cases? Also, how are the characters marked CJ in UAX14 related to the JIS mojikumi class of 行頭禁則和字? Should they be tied together tightly? I realize this class may be unique to JIS X 4051 and InDesign and not part of JLReq…

Vertical text processing, a tangled web

Another area of incompatibility is UAX50 and how SJIS vertical text glyphs were then unified in a lossy manner to Unicode. We uncovered several areas where conventions with current fonts and the ‘vert’ OpenType feature were incompatible with UAX50, so the current state is that engines cannot simply exchange their internal vertical posture table with one using UAX50 as-is. Yamamoto Taro and I spoke with Murata-san about this issue and we need to make some kind of definitive statement about how to move forward. This is another area needing a comprehensive gap analysis, and I can start with work that Yamamoto Taro did. I will send more info about this.

As for handling the necessary change of class (and behavior) when a particular codepoint is set horizontally or vertically (an example is smart quotes becoming something from SJIS in vertical), I think while a lot of the is getting into implementation detail, it will be good for the JLReq to be specific about how the situation could be improved with more information in the data passed to engines, that gives the author more control over the outcome. If this is creating virtual mojikumi classes, or creating another OTF feature, or some other mechanism, we could find out by discussing it in the context of the requirements and how to implement them.

Incompatibility with Fonts

Fonts are not standardized in how they implement the U+2xxx range, and so engines have to do different things for mojikumi class, mojikumi (JIS) tsume, vertical posture, and line-breaking as a result. Japanese fonts are mostly consistent with each other (a counter-example is Hiragino’s move from SJIS style smart quotes to Western-style ones), but Korean fonts are much worse (e.g. the U+2089 ₩ and U+FFE6 ￦ glyphs being inconsistently encoded or entered), and Korean uses a mix of Western-style punctuation with Korean characters that can be set 全角or ツメ組み or proportionally.

Anyway, when we meet please let me know if any of this was useful, and how I can help with detail of how problems arose and were solved in our implementations (or not yet solved perfectly).

--Nat


From: 木田泰夫 <kida@mac.com>
Date: Sunday, November 1, 2020 at 2:28 AM
To: Kobayashi Toshi <binn@k.email.ne.jp>
Cc: public-i18n-japanese@w3.org <public-i18n-japanese@w3.org>
Subject: Re: 明後日3日のお題
そうですね。では Eric の提案の中から前回話し合った文脈依存クラスに関係する項目についてまず議論するのはどうでしょう。

1. 文脈依存クラスをどうするか
前回の議論は、文字クラスを純粋な文字のプロパティにするために、文脈依存クラスを無くすことが目的でした。そのようなケースは説明で処理するという解決方法を議論しました。

Eric はこれとは異なる解決方法を提案していて、それは非常に優れた方法に感じられます。彼の提案は、縦中横、などの仮想クラスを導入すること。

現在の JLReq でも例えば Appendix D 表3 で “line head”, “line end” と言う仮想クラスを使っていますし、九つある文脈依存文字クラスのうち五つ、cl-20〜23, 30 は「どの文字も xyz の文字として使うことができる」と書いてあって、事実上仮想クラスです。


こうする方法の利点は、エンジニアが実装する際に、それぞれの機能の説明を読んで理解しなくても、データテーブルを解釈するエンジンがあれば簡単に高速に実行できる点です。説明にしか書いていない場合、それを理解して、例外処理を一つ一つ書かなくてはなりません。Eric が分離禁止文字を文字ごとに別クラスにすることを提案しているのも全く同じ理由です。データテーブルが少々大きくなっても、例外処理が少なくなるならそちらの方が全体としてはシンプルになります。

我々にとっては別の利点があるかもしれません。「JLReq は複雑なので、全容を理解して、それぞれの機能を実装するのはとても無理。どれが結局重要なの？」というのは我々がよく得ていたフィードバックです。そのために、日本語組版をシンプルにすることを私は主張してきました。シンプルにすることは依然として重要だと思いますが、データテーブルにして統一的に処理できることを示せたなら、その部分に関してプレッシャーは下がると思います。より上を目指せると言うわけです。

と言うことで、前回のミーティングの結果を読み直して、どのようにデータテーブルにできるのか考えてみるのはどうでしょう？　「どの文字も…」ではない文脈依存文字クラスは下の三つですが、それ以外にも、分離禁止文字のように、動作が説明で書いてあってテーブルに現されていない例（ルビ？）があればそれも洗い出す。


・単位記号中の文字（cl-25）
・割注始め括弧類（cl-28）
・割注終わり括弧類（cl-29）


2. 行の折り返しを分離すること
Eric のもう一つの重要な提案は、字間だけを考えてクラスを作ること。つまり行の折り返しは別に考えるということです。その理由は Unicode の UAX14 / CLDR を使った行の折り返しがうまく動いているのでそれを使えばいいじゃない、ということです。

JLReq は「どうあるべきか」を示す必要がありますので、Unicode のでいいじゃないとは言えません。しかし行の折り返しのためのクラスと、字間のためのクラスを分離すること自体は JLReq にとっても合理的な可能性があります。特に、行の折り返しのために必要なクラスと、字間のために必要なクラスが異なるなら、そう言えるでしょう。

また、gap analysis (!) のタスクとして、Unicode での行の折り返しと、JLReq とのギャップを調べること、がありますね。

3. 縦書きと横書きで異なるクラス
Eric は縦書きと横書きで文字の属するクラスを一部変えています。Eric の GitHub コメントによれば、例えば U+0041 A や U+00C6 Æ は、横書きや縦書きで横倒しされた場合には欧文としての挙動（cl-27）、縦書きで正立した場合には漢字と同じ挙動（cl-19）をするだろう？との説明でした。時間があればこのポイントについて。


みなさま、他に関係する重要な点はありますか？ 私が見落としている可能性もありますので、ぜひ教えてください。

木田

私の理解で Eric の提案の重要なポイントをまとめますと：

・行の折り返しは UAX14 / CLDR に任せていること。これは Unicode 化にあたって順当ですね。我々は、UAX14 / CLDR と JLReq の相違している点をチェックする必要がありますね。

・それによって文字クラスは字間だけ取り扱えば良いことになります。字間クラスと呼び変えた方が良さそうです。で、それを Unicode のプロパティとして提案するという点。これは我々の議論でも出てきていましたが、重要なポイントですね。

・パラグラフの最初、縦中横、などの仮想的なクラスの提案。先週我々が議論した、文脈依存の文字クラス、を含めて統一的に処理できるように思います。なぜこれを思いつかなかたんだ！と思っております。

・縦書き専用クラス、横書き専用クラスのあること。これの具体的な利点がまだよく理解できていないのですが、重要なポイントの可能性があるので挙げておきます。

・糊、つまりデフォルトの字間とその調整の流儀を明に扱おうという提案。

・また、マークアップにより字間クラスを明に変更できるようにとの提案。




2020/11/01 11:48、Kobayashi Toshi <binn@k.email.ne.jp>のメール:

木田泰夫　様

　小林　敏　です．

Ericの提案は，原則的な問題を含んでいますから，木田さんの要約していただい
たメールから，そこから出てくる課題をまず拾い出すとからはじめたらどうでし
ょうか（簡単に解決できればその場で中身も），で，難しい問題は問題として残
しおく．そして，次に欧文の問題を考えるという順序，でも，欧文の問題は多岐
にわたるので，ベースとなる全角と半角をどう考えるか，ということから始めて
はどうかな？

木田泰夫　さんwrote

みなさま、
Eric の提案に含まれる諸アイディア、欧文の問題、など JLReq Unicode 化（もっと
一般的に言って国際化ですね）の話題が豊かになってきていますが、明後日3日はどれ
から手をつけましょうか？
木田
Received on Sunday, 1 November 2020 18:31:57 UTC