Re: [css-text-3] word-break: break-all from Florian Rivoal on 2015-10-15 (www-style@w3.org from October 2015)

From: Florian Rivoal <florian@rivoal.net>
Date: Thu, 15 Oct 2015 14:46:29 +0900
To: Koji Ishii <kojiishi@gmail.com>
Cc: "www-style@w3.org" <www-style@w3.org>, CJK discussion <public-i18n-cjk@w3.org>
Message-Id: <755D5F64-C86B-426B-A96E-C05B9A10E175@rivoal.net>

> 
> On 15 Oct 2015, at 01:33, Koji Ishii <kojiishi@gmail.com> wrote:
> 
> Several months ago, Blink changed the implementation of "word-break:
> break-all"[1] to as the spec defines:
> 
>  may break between any two typographic letter units
> 
> This value is, as written in the spec, designed to be easy to
> implement without sacrificing CJK line break rules, since we believed
> its primary use is in CJK.
> 
> However, since our change, I hear that it does not work as expected
> from Latin and other non-CJK authors such as Arabic, and Blink is the
> only browser that is broken. Examples I've got are to expect to break
> anywhere in "AT&T" or "*****", and Trident/Gecko/WebKit all break
> these strings.
> 
> So I'd like to propose to change the spec so that it can serve both
> CJK and non-CJK usages, and is more interoperable with existing
> implementations.
> 
> I checked the behavior for ASCII code points here[2], but in short:
> 
> Trident/Edge: Breaks almost anywhere except before closing
> parenthesis, period, etc. "&" and "*" in the examples above can break
> before and after.
> Gecko/WebKit: Breaks anywhere.
> 
> Since what Gecko/WebKit does is quite unfortunate for CJK, I'm
> thinking to be similar to what Trident/Edge does.
> 
> As far as I can see from ASCII code range, the rules are:
> 
> * Not break before !"'),./:;?]}
> * Not break after "$'(-[\{
> 
> So by translating them to UAX#14 Line Breaking classes, rules would be:
> 
> * Not break before EX, QU, CP, IS, SY
> * Not break after QU, PR, OP, HY, PR
> 
> I think I'll need to check side-effects and Trident/Edge behavior a
> little more in details, but would appreciate opinions/feedback if any.
> 
> [1] https://drafts.csswg.org/css-text-3/#valdef-word-break-break-all
> [2] http://kojiishi.github.io/playgrounds/line-break-matrix/?word-break=break-all

I agree with your concern for this area, as breaking at the right place in CJK
and other languages that do not separate words with space is important.

However, I do not think it is possible to agree on a normative and exhaustive list
of characters before or after which break are forbidden or allowed.

For one, the list is certainly different across languages, and even within the
same language (at least for Japanese), different publishing traditions insist on
house rules with slightly different sets. (See wikipedia[1] for info on these
two facts).

This was taken into account when writing the specification, and it currently does not
call for the poor behavior for CJK you have observed in Gecko/Webkit, or for latin text
in Blink.

The definition of break-all says: "lines may break between any two typographic letter
units **except where forbidden by the line-break property**"

"line-break:auto" says: "The UA determines the set of line-breaking restrictions to use".

I'd welcome adding a clarification to that sentence by saying something along the lines of:
"The restrictions should vary depending on the language, and are particularly important for
languages which do not use spaces as word separators to avoid opening of closing punctuation
being placed at the end or beginning of the line".

The second half of this sentence could also be a note.

I think this is already allowed and implied by the current text, but the fact that
we are in the situation you describe means it sounds worth being explicit about it.

As for allowing author control over the precise set of characters involved in these
restrictions, the specification already has a note pointing out that future levels
of the spec may need to introduce finer controls about this, which I think is
appropriate.

 - Florian

[1] https://en.wikipedia.org/wiki/Line_breaking_rules_in_East_Asian_languages

Received on Thursday, 15 October 2015 05:47:03 UTC