- From: Koji Ishii <kojiishi@gmail.com>
- Date: Thu, 15 Oct 2015 19:00:38 +0900
- To: Florian Rivoal <florian@rivoal.net>
- Cc: "www-style@w3.org" <www-style@w3.org>, CJK discussion <public-i18n-cjk@w3.org>
Thank you for you reply. On Thu, Oct 15, 2015 at 2:46 PM, Florian Rivoal <florian@rivoal.net> wrote: >> >> On 15 Oct 2015, at 01:33, Koji Ishii <kojiishi@gmail.com> wrote: >> >> Several months ago, Blink changed the implementation of "word-break: >> break-all"[1] to as the spec defines: >> >> may break between any two typographic letter units >> >> This value is, as written in the spec, designed to be easy to >> implement without sacrificing CJK line break rules, since we believed >> its primary use is in CJK. >> >> However, since our change, I hear that it does not work as expected >> from Latin and other non-CJK authors such as Arabic, and Blink is the >> only browser that is broken. Examples I've got are to expect to break >> anywhere in "AT&T" or "*****", and Trident/Gecko/WebKit all break >> these strings. >> >> So I'd like to propose to change the spec so that it can serve both >> CJK and non-CJK usages, and is more interoperable with existing >> implementations. >> >> I checked the behavior for ASCII code points here[2], but in short: >> >> Trident/Edge: Breaks almost anywhere except before closing >> parenthesis, period, etc. "&" and "*" in the examples above can break >> before and after. >> Gecko/WebKit: Breaks anywhere. >> >> Since what Gecko/WebKit does is quite unfortunate for CJK, I'm >> thinking to be similar to what Trident/Edge does. >> >> As far as I can see from ASCII code range, the rules are: >> >> * Not break before !"'),./:;?]} >> * Not break after "$'(-[\{ >> >> So by translating them to UAX#14 Line Breaking classes, rules would be: >> >> * Not break before EX, QU, CP, IS, SY >> * Not break after QU, PR, OP, HY, PR >> >> I think I'll need to check side-effects and Trident/Edge behavior a >> little more in details, but would appreciate opinions/feedback if any. >> >> [1] https://drafts.csswg.org/css-text-3/#valdef-word-break-break-all >> [2] http://kojiishi.github.io/playgrounds/line-break-matrix/?word-break=break-all > > I agree with your concern for this area, as breaking at the right place in CJK > and other languages that do not separate words with space is important. > > However, I do not think it is possible to agree on a normative and exhaustive list > of characters before or after which break are forbidden or allowed. > > For one, the list is certainly different across languages, and even within the > same language (at least for Japanese), different publishing traditions insist on > house rules with slightly different sets. (See wikipedia[1] for info on these > two facts). For languages, we generally honor UAX#14. For styles, this is only about when author sets word-break: break-all, so the author expecting the "minimum" style is quite clear. > This was taken into account when writing the specification, and it currently does not > call for the poor behavior for CJK you have observed in Gecko/Webkit, or for latin text > in Blink. > > The definition of break-all says: "lines may break between any two typographic letter > units **except where forbidden by the line-break property**" > > "line-break:auto" says: "The UA determines the set of line-breaking restrictions to use". > > I'd welcome adding a clarification to that sentence by saying something along the lines of: > "The restrictions should vary depending on the language, and are particularly important for > languages which do not use spaces as word separators to avoid opening of closing punctuation > being placed at the end or beginning of the line". > > The second half of this sentence could also be a note. > > I think this is already allowed and implied by the current text, but the fact that > we are in the situation you describe means it sounds worth being explicit about it. The problem is that the text says "two typographic **letters**", which is defined as Unicode L and N. So when normal breaking does not break between S or P (such as "*"), as far as I read, break between two "*" only when break-all is not allowed. If it's changed to "two typographic **characters**", I can read it that it's already allowed. Does this make sense, or do you think I read it incorrectly? > As for allowing author control over the precise set of characters involved in these > restrictions, the specification already has a note pointing out that future levels > of the spec may need to introduce finer controls about this, which I think is > appropriate. Yeah, I agree, that's what fantasai and I intended. That's a separate topic. /koji
Received on Thursday, 15 October 2015 10:01:27 UTC