Re: [csswg-drafts] [css-text-4] Add support for content-detection, phrase-based line breaking (#6730) from r12a via GitHub on 2022-02-15 (public-css-archive@w3.org from February 2022)

From: r12a via GitHub <sysbot+gh@w3.org>
Date: Tue, 15 Feb 2022 17:34:54 +0000
To: public-css-archive@w3.org
Message-ID: <issue_comment.created-1040570582-1644946492-sysbot+gh@w3.org>

Here are some more questions that occurred to me while thinking this through.

1. Should we have an additional parameter `phrase` to create the segmentation desired here, or should we aim to convince the people creating the segmentation algorithms that (at least for Japanese & Chinese) they should be segmenting by default on phrases, and therefore we'd add a `word` parameter to do the opposite, ie. break particles and such apart? Looking at the Chinese examples, it seems like the phrase approach is a better default. Not sure whether there are backwards compatibility issues with that.
2. I'm finding it hard to see the Thai case as similar. My understanding is that the Thai case is to do with whether or not to aggressively/accurately break compound words. I wonder to what extent that needs linguistic understanding, so that we don't break things that really shouldn't be split (like breaking 'blackbird' in English). It may be that we could provide a preference for situations where the segmentation is down to personal preference, but the segmentation algorithms would need to allow for that choice by beefing up the sophistication of their parsing. But it doesn't seem to me to involve the same set of criteria as keeping phrasal parts together.
3. I can see the potential for keeping prepositions with associated words in languages that have spaces between words, but i assume that that would need the user agent to start applying linguistic analysis on a language-specific basis to a large number of languages. Is that feasible?

(Btw, fwiw, no-one has mentioned it yet, and i don't remember seeing it in the css-text-4 spec, but if you want to do this kind of thing manually then U+2060 WORD JOINER is your friend. (Does the opposite of ZWSP/`<wbr>`.))

--
GitHub Notification of comment by r12a
Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/6730#issuecomment-1040570582 using your GitHub account

--
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Tuesday, 15 February 2022 17:34:55 UTC