Re: 明後日3日のお題 from 木田泰夫 on 2020-11-02 (public-i18n-japanese@w3.org from October to December 2020)

From: 木田泰夫 <kida@mac.com>
Date: Mon, 2 Nov 2020 14:43:08 +0900
To: Fuqiao Xue <xfq@w3.org>
Cc: Nat McCully <nmccully@adobe.com>, Kobayashi Toshi <binn@k.email.ne.jp>, public-i18n-japanese@w3.org
Message-Id: <5F719CFD-B438-450B-91BA-5F265CF6FEA3@mac.com>
せつさん、ありがとうございます。

今回やるべきなのは、Unicode の方法と、JLReq の方法の gap ですので、1 は現時点では関係がないと思います。が、2 は重要なポイントですね。情報ありがとうございます。このような実装で例外処理を行うケースを列挙する良い方法があるかご存知ですか？ ICU を使わない場合はこのような ICU での例外処理は関係がないので、まずは UAX14 / CLDR に限定して調べるのも良いかもしれませんね。

木田

> 2020/11/02 14:01、Fuqiao Xue <xfq@w3.org>のメール:
> 
> Natさん、どうもありがとうございます！
> 
> ギャップを調べる際に注意すべきいくつかのポイントを考えられます：
> 
> 1. 出版社によって禁則が異なる場合があり（例えば、繰返し記号や長音記号を行頭禁則とするかどうかとか。中国語でもいくつかのスタイルがあります[1]）、比較する時どのスタイルを選べば良いか（またはさまざまなスタイルを比較するか）を考える必要があります。
> 
> 2. 禁則の一部はUAX #14ではなく、（ICUなどの）実装にあります[2]。この部分も確認する必要があります。
> 
> よろしくお願いします。
> 
> せつ
> 
> [1] https://w3c.github.io/clreq/#prohibition_rules_for_line_start_end
> [2] 例：https://github.com/unicode-org/icu/pull/223
> 
>> On 2020-11-02 10:37, Nat McCully wrote:
>> 木田さん、どうもありがとうございます。おっしゃることはとてもよく分かりました。シンプルさのバトルはもう何年も...
>> で、アドバイスとして貴重なこと書いてくれてありがとうございます。
>> UAX のこと、分かりました。何とか作ってみます。
>> —Nat
>> -------------------------
>> From: 木田泰夫 <kida@mac.com>
>> Sent: Sunday, November 1, 2020 6:06:37 PM
>> To: Nat McCully <nmccully@adobe.com>
>> Cc: Kobayashi Toshi <binn@k.email.ne.jp>; public-i18n-japanese@w3.org
>> <public-i18n-japanese@w3.org>
>> Subject: Re: 明後日3日のお題
>> Nat
>> ありがとう（長い！ がこの私のレスポンスも長くなってしまいました）
>> Eric
>> の具体的に示してくれたアプローチはスケーラブルなので、Nat
>> さんの言われる精緻な、またはカスタマイズ可能な実装を作るのに適していると思います。データにしてしまえばその中を理解してもらう必要もないので、より複雑なルールをしれっと組み込むことができます
>> :)
>> 最も優れたプロ用アプリケーション（例えば InDesign
>> ね）のレイアウトをさらに良くするのも重要ですし、普通の人が毎日使うアプリケーション（例えばメールとか
>> GitHub
>> の中とか）の日本語の美しさを底上げするのも重要。両方心に置いておく必要があると思います。
>>> I prefer to set a general context and let the details come out in
>>> the discussion, but I sense some dissatisfaction in this approach. I
>>> will try to spend some time describing specifics I have run into as
>>> a way to get there
>> その方が良いことはわかるんだけどね、と言う問題ですね。大きな方向づけを共有することはとても重要で、例えば空間が大事、と言う
>> Nat
>> の提起はとても意味のあることだと思います。それは進むべき方向を与えます。しかしその先に進もうとすると、誰かが、それを具体的な規則にし、さらに実装方法を考える必要があります。それはまさにエンジニアの仕事ですので、Nat
>> にとても期待しています。
>> JLReq
>> は規則だけ描写すればよくて、実装方法を考える必要はない、は理想ですが、JLReq
>> のゴール、デジタルでの日本語組版をよくしたい、を考えるとそうも言っておられません。技術の歴史を見ると、常に、それを実現する良い方法があるか、が決定的な
>> driving force
>> でした。つまり「望ましいこと」→「やる」よりも、「やる方法がある」→「やる」の力の方が強いのです。その実装可能性に対してシンプルさは必須の条件です。複雑さはそのままエンジニアによる実装コストに繋がり、即、プライオリティの低下に結びつくからです。昔なら「でき」さえすれば、時間がかかっても緩い時間の流れと低い人件費が許してくれましたが、ここは昔とシンプルさの重要性が変わってきている点ですね。
>> 私が JLReq
>> にシンプルさを求めるのはそれが理由です。これはその総体がシンプルかどうかではなくて、実装に必要な知識をシンプルに理解できるか、実装の手間がシンプルか、と言うことです。私は今まで、総体をシンプルにすることでそれを達成しようと思っていました。しかし、Eric
>> の提案を見て、少し考えが変わりました。
>> 例えば Unicode
>> は非常に複雑なシステムですが、特定の実装に依存しない形で、優れた付随データテーブルを提供することでそれをシンプルに採用できるように工夫しています。JLReq
>> も同様なアプローチをとることができると思います。そうすれば、より高度な組版を、外側に複雑さを見せることなく提案することができます。その点で、文字クラスやデータテーブルが重要で、Eric
>> の提案に大きな意味がある、と見ています。
>> で、お願いなのですが：
>>> Japanese line breaking vs Unicode line breaking for all
>>> Vertical text processing, a tangled web
>> UAX14 および UAX50 の内容と JLReq の間の gap の分析、Nat
>> にお願いしたいんだけれど、やってもらえないかな。
>> 木田
>>> 2020/11/02 3:32、Nat McCully <nmccully@adobe.com>のメール:
>>> 
>> 仮想文字組みクラス（文字コードベースのみより）
>> 仮想クラスは必要です。でも仮想クラスの数は多いかもしれないです。うちは縦中横を「その他の和字」として認識しているのですが、JLReqに縦中横は何の挙動なのかは漢字などと別のところに指定すべきです。「その他の和字」とは違う場合は歴史上にはあったのであればその情報も記述して欲しいです。割注の括弧は普通の括弧とどう違うかもそうです。今は同じクラスで、同じ挙動。でも、歴史上では違ったよ、とあれば記述して欲しいです。ルビの下のテキストもそうです。
>> 段頭、行頭、行末の仮想クラスは必須です。なぜかというと括弧類との関係は特別で、ハウスルールはそれぞれ。
>>> 国際化
>> UAX14との関係も、UAX50との関係も、日本語組版の伝統的な必須な挙動とは違うところを洗い出して、gap
>> analysisは議論すべきでしょう。私は一番繰り返して説明しなければいけないところは、日本語組版にはone
>>> size fits
>> allは非常に難しいところです。標準は決められないか、決めても40%のユーザーはカスタムしている、とか。Unicodeはその点についてはあまり納得いかないところが印象です。
>>> 以下は英語で申し訳ないです！
>>> I want to make sure my poor Japanese is not detracting from what I
>>> have to say about JLReq and how it relates to implementations
>>> (especially those trying to fit Japanese composition rules into a
>>> larger more international context). What I will say below may strike
>>> you all as too general to discuss, but I hope we can get into more
>>> detail as we cover the topics. It has been difficult for me to come
>>> up with enough specifics immediately from the beginning, as I prefer
>>> to set a general context and let the details come out in the
>>> discussion, but I sense some dissatisfaction in this approach. I
>>> will try to spend some time describing specifics I have run into as
>>> a way to get there…
>>> JLReq’s critical role, augmenting UAX##
>>> What JLReq needs to do is specify where an engine must accommodate
>>> these customizations (where one size does not fit all), and explain
>>> why they are important. If there is no explanation as to the
>>> importance, these customizations are likely to be dropped, and what
>>> we will get is only slightly better than what we have today. Truly
>>> professional results will still be very difficult to automate or
>>> impossible on the Web.
>>> Internationalization of the JLReq
>>> I think this is a very good direction for the group to be focusing
>>> on – how to relate the JLReq to other international standards and
>>> conventions. What I have always discovered is, traditional Japanese
>>> publishing and the way digital fonts have evolved, and then how
>>> Unicode tried to simplify and unify the various encodings and glyph
>>> designs, have caused incompatibilities that then engines have had to
>>> deal with in custom ways. There is a reason InDesign’s solution is
>>> different even from other Adobe apps, to say nothing about other
>>> companies or platforms – InDesign tried to remain faithful to the
>>> print publishing industry, trying to reproduce what was done by hand
>>> on proprietary equipment before the age of Unicode and SJIS and
>>> digital fonts with a position origin at the left side Roman baseline
>>> and a height of ascent plus descent. And, all 2b glyphs being the
>>> same width of 1 em. These were technological limitations that we
>>> inherited, and InDesign tried internally to separate these issues
>>> from the user so the user could still think in terms that were
>>> natural to them – the 仮想ボディー, the center-based
>>> positioning and baseline and gridding, the fact that 括弧類 are
>>> considered 半角 when set flush.
>>> Japanese line breaking vs Unicode line breaking for all
>>> So, one next step is to analyze UAX14 line breaking and determine
>>> how important it is for the user to customize the kinsoku set beyond
>>> what is stated there, and whether it is important for the line
>>> compression to be related to the presence of kinsoku at line end
>>> (how a UAX14-based engine relates line layout to its classifications
>>> to support 追い込み処理 etc). How would that change if UAX14
>>> was used? Could an engine use UAX14 as its default in all cases?
>>> Also, how are the characters marked CJ in UAX14 related to the JIS
>>> mojikumi class of 行頭禁則和字? Should they be tied together
>>> tightly? I realize this class may be unique to JIS X 4051 and
>>> InDesign and not part of JLReq…
>>> Vertical text processing, a tangled web
>>> Another area of incompatibility is UAX50 and how SJIS vertical text
>>> glyphs were then unified in a lossy manner to Unicode. We uncovered
>>> several areas where conventions with current fonts and the
>>> ‘vert’ OpenType feature were incompatible with UAX50, so the
>>> current state is that engines cannot simply exchange their internal
>>> vertical posture table with one using UAX50 as-is. Yamamoto Taro and
>>> I spoke with Murata-san about this issue and we need to make some
>>> kind of definitive statement about how to move forward. This is
>>> another area needing a comprehensive gap analysis, and I can start
>>> with work that Yamamoto Taro did. I will send more info about this.
>>> As for handling the necessary change of class (and behavior) when a
>>> particular codepoint is set horizontally or vertically (an example
>>> is smart quotes becoming something from SJIS in vertical), I think
>>> while a lot of the is getting into implementation detail, it will be
>>> good for the JLReq to be specific about how the situation could be
>>> improved with more information in the data passed to engines, that
>>> gives the author more control over the outcome. If this is creating
>>> virtual mojikumi classes, or creating another OTF feature, or some
>>> other mechanism, we could find out by discussing it in the context
>>> of the requirements and how to implement them.
>>> Incompatibility with Fonts
>>> Fonts are not standardized in how they implement the U+2xxx range,
>>> and so engines have to do different things for mojikumi class,
>>> mojikumi (JIS) tsume, vertical posture, and line-breaking as a
>>> result. Japanese fonts are mostly consistent with each other (a
>>> counter-example is Hiragino’s move from SJIS style smart quotes to
>>> Western-style ones), but Korean fonts are much worse (e.g. the
>>> U+2089 ₩ and U+FFE6 ￦ glyphs being inconsistently encoded or
>>> entered), and Korean uses a mix of Western-style punctuation with
>>> Korean characters that can be set 全角or ツメ組み or
>>> proportionally.
>>> Anyway, when we meet please let me know if any of this was useful,
>>> and how I can help with detail of how problems arose and were solved
>>> in our implementations (or not yet solved perfectly).
>>> --Nat
Received on Monday, 2 November 2020 05:43:25 UTC