Re: [csswg-drafts] [css-text-3] Segment Break Transformation Rules around CJK Punctuation (#5086)

I agree web compatibility is important. However, perfect web compatibility will be impossible unless we give up any space-discarding rules. For example, space-discarding between two Katakana/Hiragana/Kanji letters are not always safe, because some Japanese text use space (U+0020) in Katakana compound words (e.g., "エンド ユーザー" or "クイック スタート" in [Microsoft's Japanese Documents](https://docs.microsoft.com/ja-jp/)), or in Japanese text with [わかち書き](https://ja.wikipedia.org/wiki/%E3%82%8F%E3%81%8B%E3%81%A1%E6%9B%B8%E3%81%8D) using space between words. But those are relatively exceptional cases and we can expect that most Japanese text authors will not put line breaks where spaces are important.

We need to find the best balance between improvement and compatibility.

Thank you @kojiishi for rethinking. Yes, ideographic/fullwidth commas and full stops [、。,.] cover most of cases. However, many people will complain with it: Why line breaks after fullwidth colon, semicolon, exclamation marks and question marks [:;!?] cause extra spaces? Those characters are listed in the same [Pause or Stop Punctuation Marks](https://w3c.github.io/clreq/#h-pause-or-stop-punctuation-marks) category in CLReq. And that does not cover the cases that I gave examples:

```
日本語のテキストにEnglish text
(英語のテキスト)
を埋め込む。
↓
日本語のテキストにEnglish text(英語のテキスト)を埋め込む。
(In this example, fullwidth parentheses are used)
```

```
日本語のテキスト! 
English textを埋め込む。
↓
日本語のテキスト! English textを埋め込む。
(In this example, there is an ideographic space U+3000 ' ' after the '!')
```

I don't think these space-discarding cases cause web compatibility problem.

So I still believe that the rule I proposed has the best balance between improvement and web compatibility.

I understand that the current draft's rule that requires **both** sides to belong to the space-discarding character set for not inserting a space is for web compatibility, but this rule alone cannot meet the semantic line breaks requirements. So I proposed the additional rule that requires **either** side to belong to the *strong space-discarding character set* (= a subset of the space-discarding character set limited to Unicode Punctuation and Space Separator). I think it is easy to understand that *strong space-discarding character* requires only one side to discard space because these characters are natural or semantic break points in CJK text.

I think this rule is easier to understand/remember/predict than limiting only ideographic/fullwidth commas and full stops. We can understand/remember that ambiguous punctuations are not included in the space-discarding character set because such punctuations, e.g. left and right quotation marks, em-dash, ellipsis, etc., can be used in non-CJK text.


-- 
GitHub Notification of comment by MurakamiShinyu
Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/5086#issuecomment-634578395 using your GitHub account

Received on Wednesday, 27 May 2020 10:42:26 UTC