Re: [csswg-drafts] [css-text-4] Add support for content-detection, phrase-based line breaking (#6730) from Koji Ishii via GitHub on 2022-03-18 (public-css-archive@w3.org from March 2022)

From: Koji Ishii via GitHub <sysbot+gh@w3.org>
Date: Fri, 18 Mar 2022 17:24:38 +0000
To: public-css-archive@w3.org
Message-ID: <issue_comment.created-1072628225-1647624276-sysbot+gh@w3.org>
Thanks for the feedback again and sorry for my belated replies.

@r12a 
> Is the intention is to treat natural line breaking as a separate set of controls from those used for kinsoku-like rules (punctuation wrapping) and the strict|normal|loose controls for controlling line-breaking around small kana?

Yes. Authors want to control the strength of Kinsoku-rules separately from this feature.

@litherum:
> 1. I think this (abstract) feature is a good idea. Browsers can do much better in their line breaking than they do.

Fully agree with you.

> 2. Guarding this behind an opt-in is a good idea for performance. (Edit: Actually, depending on how this feature is scoped, it may actually make sense to experiment with enabling it by default. Benchmarks and a proof-of-concept implementation would be useful.)

Not only for performance, but this should be authors' choice.

For example, the default line breaking of Japanese TeX is "balanced" with normal break opportunities (every character except where the Kinsoku rules apply.) This is because ragged right is rather a large penalty for CJK line breaking. Authors normally prefer less ragged-right lines over phrase-based break opportunities for body text, but may prefer phrase-based line breaking for display text. They may want to use "balanced" line breaking for both cases.

> 3. I'm not sure that this is implementable today. I'm not aware that either Foundation or ICU has any functionality to determine these breaking locations. Without a demonstration of how to implement this, I'd be against this proposal.

[ICU 71](https://github.com/unicode-org/icu/releases/tag/release-71-rc) supports Japanese phrase-based line breaking with a new value for the [`lw`](https://www.unicode.org/reports/tr35/#UnicodeLineBreakWordIdentifier) keyword. The `lw` keyword was chosen because the phrase-based behavior is exclusive to `break-all` and `keep-all`.

Android 13 supports [wrap text by Bunsetsu (the smallest unit of words that sounds natural) or phrases](https://android-developers.googleblog.com/2022/03/second-preview-android-13.html#:~:text=wrap%20text%20by%20Bunsetsu%20(the%20smallest%20unit%20of%20words%20that%20sounds%20natural)%20or%20phrases).

> 5. I wonder whether the algorithm for this new line breaking mode would be "it's just like the greedy approach we have today, but the opportunities are in different places" or if it's more complicated like "you can break in some particular position, but there's a cost, and it's only worth it if breaking there means you can choose better positions in the rest of the paragraph"

Greedy vs paragraph-level-balanced is a related topic, but they should be set separately, at least for some languages such as Japanese. I'm not sure other languages, such as English, always want to turn on/off both switches together.

> 6. We (or Unicode) would also need to determine how this would work in all languages, not just Japanese. English has phrases, and there is an art of laying out a title (e.g. in print publications). Would it apply there?

Excellent point, thank you for pointing this out. I believe it should apply too, as Apple web site does and as I relied to @r12a above, but we are not sure how exactly it should work yet. I think, at this moment, "it might apply to other languages" is fine to define a property in CSS. It's similar to how CSS defines a CJK "word" today; sometimes a compound noun is a word, sometimes it's multiple words, they vary depending on the dictionaries, era, or how authors feel more "natural".

> 7. `wrap-inside:avoid` is kind of similar to `hyphens:manual`, but the real feature here would be the equivalent of `hyphens:auto`. That's why I don't think that `wrap-inside:avoid` is sufficient. You'd want a single switch to flip on this kind of line breaking, rather than having the author have to implement it themselves. And, if the author actually wants to implement it themselves, `wrap-inside:avoid` is there for them, and that will cause consistent renderings across browsers.

Agreed. Also for pre-processors like the [BudouX](https://github.com/google/budoux), wrapping each phrase in a span is a complicated work. For example:
```html
<div>Phra<span style="border: 1px solid blue">se1 Phra</span>se2</div>
```
It's not easy to wrap "Phrase1" and "Phrase2" each in a span. Maybe one can adjust borders, but there are more -- background-image, filter, etc.

-- 
GitHub Notification of comment by kojiishi
Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/6730#issuecomment-1072628225 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config
Received on Friday, 18 March 2022 17:24:39 UTC