[csswg-drafts] line-break, word-break: language unclear, and a new testcase. from Mike Bremford via GitHub on 2018-04-13 (public-css-archive@w3.org from April 2018)

From: Mike Bremford via GitHub <sysbot+gh@w3.org>
Date: Fri, 13 Apr 2018 10:06:15 +0000
To: public-css-archive@w3.org
Message-ID: <issues.opened-314046884-1523613974-sysbot+gh@w3.org>
faceless2 has just created a new issue for https://github.com/w3c/csswg-drafts:

== line-break, word-break: language unclear, and a new testcase. ==
### The language for line-break and (in particular) word-break, is unclear with regard to what changes are required to the UAX14  algorithm.

I've made a pull request for a new testcase we've been working up at https://github.com/w3c/web-platform-tests/pull/10420. This testcases is complete but will require review due to the ambiguities described below.

While developing this is became apparent that some of the language in the spec was a bit unclear - certainly to me, and as I'm seeing different results with this testcase in different browsers, maybe others.

First, I expect I am not the first to point out that "word-break" and "line-break" have some considerable overlap. As described, breaks within words like ちょっと (UAX14 classes ID CJ CJ ID) are covered by the line-break rule, although this is a single word. And of course, "line-break: anywhere" will break words. Some sort of clarifying note as to the interaction of these two features might help.

Specific areas of the text that are a bit confusing or incomplete:

* word-break states it "_controls whether a soft wrap opportunity exists between adjacent typographic letter units (or other typographic character units belonging to the NU, AL, AI, or ID Unicode line breaking classes_" - although the note at the bottom of "keep-all" explicitly mentions Korean, the classes H2, H3, JL, JT and JV are excluded from this list. I don't know Korean so I'm unsure if that is a deliberate omission. It also doesn't mention classes CJ or NS, and again I'm not sure if this is a deliberate omission. Given the overlap with line-break it may be better to dump this descriptive paragraph completely in favour of exact descriptions of the behaviour of each property with regard to UAX14, as I've added below.

* The language of "word-break: keep-all" is still a bit unclear with regards to the changes it mandates to UAX14. For example, "_Breaking is forbidden within “words”: implicit soft wrap opportunities between typographic letter units are suppressed_" makes no mention of character class, so isn't much help if you're implementing this. UAX14 describes this same customization as used for "ragged" korean text, and specifies "_... breaking after spaces (as in Latin text)_". I believe the intention here is to treat **all** ideographic characters as if they were latin text.

* line-break: anywhere is described as providing "_a soft wrap opportunity around every typographic character unit, including around any punctuation character or preserved spaces, or in the middle of words, disregarding any prohibition against line breaks introduced by characters with the GL, JW, or ZJW character class_". It then states in the note that "_This value triggers the line breaking rules typically seen in terminals._". If that's the intention then the mention of GL, JW and ZJW (which should be WJ and ZWJ by the way) is superfluous and confusing. And also superfluous. The final sentence should be "disregarding any prohibition", full-stop end of. Literally anywhere in the text is a valid break-point, even before U+20

* What happens if I specify "word-break: keep-all; line-break: anywhere". The two rules contradict eachother; which one wins?

* Using the language of the text as an input to the algorithm seems a bit odd to me. Is there any reason "loose-cj" and "normal-cj" values for line-break could not be used to achieve the same thing? Not really a serious issue and I can't think of a specific reason why it's a problem, it just feels out of character with the rest of the spec so thought I'd raise it while I'm typing.

We've interpreted the various property values as having the following meaning. Whether they're correct or not is almost a secondary issue at this stage; what I'm getting at is that these definitions are exact enough to work from, so I think it would be great if the descriptions for these property values were rewritten in this form, i.e. detailing exactly what changes need to be made to UAX14.

* "word-break: normal" controls breakpoints between AI, AL, CJ, H2, H3, HL, ID, JL, JT and JV exactly as defined in UAX14. This allows breakpoints in the middle of CJK words, and denies them in non-CJK words. _(note: existing description states "customary rules as described above", which is nowhere near exact enough)_

* "word-break: break-all" treats any glyphs of class AI, AL, HL, NU and SA as class ID for the purposes of UAX14. _(note: class AI is not listed in the current description; it probably should be, as UAX14 LB1 suggests that class AI is resolved to another class. HL was also missing, I think it should be treated as for AL)_

* "word-break: keep-all" treats any glyphs of class AI, CJ, H2, H3, ID, JL, JT and JV as if they were class AL for the purposes of UAX14. In other words, CJK text will be broken exactly as if it was latin text, i.e. with spaces.

* "line-break: anywhere" allows a breakpoint between any two typographic character units. The restrictions defined in UAX14 do not apply, and the value of "word-break" is ignored.


*(note: this issue originally posted against the wrong repository at https://github.com/w3c/web-platform-tests/issues/10423)*


Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/2559 using your GitHub account
Received on Friday, 13 April 2018 10:06:19 UTC