Re: [css-text-3] word-break: break-all from Koji Ishii on 2015-10-15 (public-i18n-cjk@w3.org from October to December 2015)

From: Koji Ishii <kojiishi@gmail.com>
Date: Thu, 15 Oct 2015 19:00:38 +0900
To: Florian Rivoal <florian@rivoal.net>
Cc: "www-style@w3.org" <www-style@w3.org>, CJK discussion <public-i18n-cjk@w3.org>
Message-ID: <CAN9ydbVTZORVbpVB5Nced3gydjEw+OK51vp99mFenNeBhBhL1Q@mail.gmail.com>
Thank you for you reply.

On Thu, Oct 15, 2015 at 2:46 PM, Florian Rivoal <florian@rivoal.net> wrote:
>>
>> On 15 Oct 2015, at 01:33, Koji Ishii <kojiishi@gmail.com> wrote:
>>
>> Several months ago, Blink changed the implementation of "word-break:
>> break-all"[1] to as the spec defines:
>>
>>  may break between any two typographic letter units
>>
>> This value is, as written in the spec, designed to be easy to
>> implement without sacrificing CJK line break rules, since we believed
>> its primary use is in CJK.
>>
>> However, since our change, I hear that it does not work as expected
>> from Latin and other non-CJK authors such as Arabic, and Blink is the
>> only browser that is broken. Examples I've got are to expect to break
>> anywhere in "AT&T" or "*****", and Trident/Gecko/WebKit all break
>> these strings.
>>
>> So I'd like to propose to change the spec so that it can serve both
>> CJK and non-CJK usages, and is more interoperable with existing
>> implementations.
>>
>> I checked the behavior for ASCII code points here[2], but in short:
>>
>> Trident/Edge: Breaks almost anywhere except before closing
>> parenthesis, period, etc. "&" and "*" in the examples above can break
>> before and after.
>> Gecko/WebKit: Breaks anywhere.
>>
>> Since what Gecko/WebKit does is quite unfortunate for CJK, I'm
>> thinking to be similar to what Trident/Edge does.
>>
>> As far as I can see from ASCII code range, the rules are:
>>
>> * Not break before !"'),./:;?]}
>> * Not break after "$'(-[\{
>>
>> So by translating them to UAX#14 Line Breaking classes, rules would be:
>>
>> * Not break before EX, QU, CP, IS, SY
>> * Not break after QU, PR, OP, HY, PR
>>
>> I think I'll need to check side-effects and Trident/Edge behavior a
>> little more in details, but would appreciate opinions/feedback if any.
>>
>> [1] https://drafts.csswg.org/css-text-3/#valdef-word-break-break-all
>> [2] http://kojiishi.github.io/playgrounds/line-break-matrix/?word-break=break-all
>
> I agree with your concern for this area, as breaking at the right place in CJK
> and other languages that do not separate words with space is important.
>
> However, I do not think it is possible to agree on a normative and exhaustive list
> of characters before or after which break are forbidden or allowed.
>
> For one, the list is certainly different across languages, and even within the
> same language (at least for Japanese), different publishing traditions insist on
> house rules with slightly different sets. (See wikipedia[1] for info on these
> two facts).

For languages, we generally honor UAX#14. For styles, this is only
about when author sets word-break: break-all, so the author expecting
the "minimum" style is quite clear.

> This was taken into account when writing the specification, and it currently does not
> call for the poor behavior for CJK you have observed in Gecko/Webkit, or for latin text
> in Blink.
>
> The definition of break-all says: "lines may break between any two typographic letter
> units **except where forbidden by the line-break property**"
>
> "line-break:auto" says: "The UA determines the set of line-breaking restrictions to use".
>
> I'd welcome adding a clarification to that sentence by saying something along the lines of:
> "The restrictions should vary depending on the language, and are particularly important for
> languages which do not use spaces as word separators to avoid opening of closing punctuation
> being placed at the end or beginning of the line".
>
> The second half of this sentence could also be a note.
>
> I think this is already allowed and implied by the current text, but the fact that
> we are in the situation you describe means it sounds worth being explicit about it.

The problem is that the text says "two typographic **letters**", which
is defined as Unicode L and N. So when normal breaking does not break
between S or P (such as "*"), as far as I read, break between two "*"
only when break-all is not allowed.

If it's changed to "two typographic **characters**", I can read it
that it's already allowed.

Does this make sense, or do you think I read it incorrectly?

> As for allowing author control over the precise set of characters involved in these
> restrictions, the specification already has a note pointing out that future levels
> of the spec may need to introduce finer controls about this, which I think is
> appropriate.

Yeah, I agree, that's what fantasai and I intended. That's a separate topic.

/koji
Received on Thursday, 15 October 2015 10:01:29 UTC