Re: [csswg-drafts] [css-text-4] Add support for content-detection, phrase-based line breaking (#6730)

First thoughts, a small correction and then a question or two.

> <p dir="auto">A phrase often consists of multiple words. The following Japanese example consists of 6 words, but has 3 phrases.</p>

私 | の | 名前 | は | 中野 | です。
-- | -- | -- | -- | -- | --
My |   | name | is | Nakano | .
Noun | Particle | Noun | Particle | Noun | Auxiliary verb
Phrase 1 |   | Phrase 2 |   | Phrase 3 |  

The table should read:
私 | の | 名前 | は | 中野 | です。
-- | -- | -- | -- | -- | --
My |   | name | topic marker | Nakano |  is.
Noun | Particle | Noun | Particle | Noun | Auxiliary verb
Phrase 1 |   | Phrase 2 |   | Phrase 3 |  

Note that, linguistically, the topic particle actually describes the whole phrase '私の名前', not just 名前.  So we should probably define clearly what we mean by 'phrase'.  

My initial suspicion is that this is actually only relevant to Japanese, and aims to prevent particles from wrapping without the preceding word. I think that in most languages attached suffixes are not separated from the word, and spaces are used around both, as mentioned for Korean. (Mongolian has gaps between some words and suffixes, but these are created by dedicated characters such as NNBSP or MVS.)

I'm curious to understand the application for Thai, which i thought doesn't have particles of this kind, and where line break opportunities are generally indicated by heuristics that divide words, or by use of ZWSP.  Do you have examples of where Thai needs help?

Do you also have examples of Chinese needing to keep together things that are associated with an adjoining word in this way?

Since this mentions non-CJK languages, is there an idea that languages that separate words with spaces will also need this option?

I find myself wondering whether the issue at hand is rather how word-boundary detection works, and whether instead we should define a property for that.  Note, for example, that if you double-click on 名前は the browser usually highlights the compound noun and the particle separately.  However, perhaps one could define a property that tells the browser to keep nouns and particles together as a single 'word' unit.  That kind of instruction may be more widely useful than just for line-breaking, eg. it may change the 'word' selection behaviour too.

-- 
GitHub Notification of comment by r12a
Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/6730#issuecomment-1023116751 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Thursday, 27 January 2022 11:34:33 UTC