Re: [csswg-drafts] [css-text-3] Segment Break Transformation Rules for East Asian Width property of A (#337)

For what it's worth we've just implemented this as currently specified, and it makes a real mess of some tests, e.g. [CSS2/generated-content/content-counter-004-ref.xht](https://github.com/web-platform-tests/wpt/blob/master/css/CSS2/generated-content/content-counter-004-ref.xht) - spaces between U+25FE (black square) are removed, due to the EAW property being "W".

I think basing the decision on Unicode Block rather than EAW property is certainly the way to go. In an effort to roll this forward I've had a hunt through the Unicode blocks and come up with a list that could be used as a starting point:

<a href="https://en.wikipedia.org/wiki/CJK_Radicals_Supplement_(Unicode_block)">CJK Radicals Supplement</a>
<a href="https://en.wikipedia.org/wiki/Kangxi_Radicals#Unicode">Kangxi Radicals</a>
<a href="https://en.wikipedia.org/wiki/Ideographic_Description_Characters_(Unicode_block)">Ideographic Description Characters</a>
<a href="https://en.wikipedia.org/wiki/CJK_Symbols_and_Punctuation">CJK Symbols and Punctuation</a>
<a href="https://en.wikipedia.org/wiki/Hiragana_(Unicode_block)">Hiragana</a>
<a href="https://en.wikipedia.org/wiki/Katakana_(Unicode_block)">Katakana</a>
<a href="https://en.wikipedia.org/wiki/Bopomofo_(Unicode_block)">Bopomofo</a>
<a href="https://en.wikipedia.org/wiki/Kanbun_(Unicode_block)">Kanbun</a>
<a href="https://en.wikipedia.org/wiki/Bopomofo_Extended">Bopomofo Extended</a>
<a href="https://en.wikipedia.org/wiki/CJK_Strokes_(Unicode_block)">CJK Strokes</a>
<a href="https://en.wikipedia.org/wiki/Katakana_Phonetic_Extensions">Katakana Phonetic Extensions</a>
<a href="https://en.wikipedia.org/wiki/Enclosed_CJK_Letters_and_Months">Enclosed CJK Letters and Months</a>
<a href="https://en.wikipedia.org/wiki/CJK_Compatibility">CJK Compatibility</a>
<a href="https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_Extension_A">CJK Unified Ideographs Extension A</a>
<a href="https://en.wikipedia.org/wiki/Yijing_Hexagram_Symbols_(Unicode_block)">Yijing Hexagram Symbols</a>
<a href="https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)">CJK Unified Ideographs</a>
<a href="https://en.wikipedia.org/wiki/Yi_Syllables_(Unicode_block)">Yi Syllables</a>
<a href="https://en.wikipedia.org/wiki/Yi_Radicals_(Unicode_block)">Yi Radicals</a>
<a href="https://en.wikipedia.org/wiki/CJK_Compatibility_Ideographs">CJK Compatibility Ideographs</a>
<a href="https://en.wikipedia.org/wiki/Vertical_Forms">Vertical Forms</a>
<a href="https://en.wikipedia.org/wiki/CJK_Compatibility_Forms">CJK Compatibility Forms</a>
<a href="https://en.wikipedia.org/wiki/Small_Form_Variants_(Unicode_block)">Small Form Variants</a>
<a href="https://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms_(Unicode_block)">Halfwidth and Fullwidth Forms</a>
<a href="https://en.wikipedia.org/wiki/Kana_Supplement">Kana Supplement</a>
<a href="https://en.wikipedia.org/wiki/Kana_Extended-A">Kana Extended-A</a>
<a href="https://en.wikipedia.org/wiki/Small_Kana_Extension">Small Kana Extension</a>
<a href="https://en.wikipedia.org/wiki/Taixuanjing">Tai Xuan Jing Symbols</a>
<a href="https://en.wikipedia.org/wiki/Counting_Rod_Numerals_(Unicode_block)">Counting Rod Numerals</a>
<a href="https://en.wikipedia.org/wiki/Enclosed_Ideographic_Supplement">Enclosed Ideographic Supplement</a>
<a href="https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_Extension_B">CJK Unified Ideographs Extension B</a>
<a href="https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_Extension_C">CJK Unified Ideographs Extension C</a>
<a href="https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_Extension_D">CJK Unified Ideographs Extension D</a>
<a href="https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_Extension_E">CJK Unified Ideographs Extension E</a>
<a href="https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_Extension_F">CJK Unified Ideographs Extension F</a>
<a href="https://en.wikipedia.org/wiki/CJK_Compatibility_Ideographs_Supplement">CJK Compatibility Ideographs Supplement</a>

However, this process leads me to think that we're still going go have to distinguish based on Script as well - some of these blocks will be used with Hangul:

<a href="https://en.wikipedia.org/wiki/CJK_Symbols_and_Punctuation">CJK Symbols and Punctuation</a>
<a href="https://en.wikipedia.org/wiki/Enclosed_CJK_Letters_and_Months">Enclosed CJK Letters and Months</a>
<a href="https://en.wikipedia.org/wiki/Small_Form_Variants_(Unicode_block)">Small Form Variants</a>
<a href="https://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms_(Unicode_block)">Halfwidth and Fullwidth Forms</a>
<a href="https://en.wikipedia.org/wiki/Vertical_Forms">Vertical Forms</a>

and some, eg:

<a href="https://en.wikipedia.org/wiki/Yijing_Hexagram_Symbols_(Unicode_block)">Yijing Hexagram Symbols</a>
<a href="https://en.wikipedia.org/wiki/Taixuanjing">Tai Xuan Jing Symbols</a>
<a href="https://en.wikipedia.org/wiki/Counting_Rod_Numerals_(Unicode_block)">Counting Rod Numerals</a>

are likely to be used in any script.

Not doing any segment break transformation _unless_ the script is Chinese, Japanese or Yi is going to limit the impact of any side effects.

-- 
GitHub Notification of comment by faceless2
Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/337#issuecomment-590822665 using your GitHub account

Received on Tuesday, 25 February 2020 11:33:08 UTC