W3C home > Mailing lists > Public > public-i18n-core@w3.org > April to June 2011

Re: [css3-text] New Working Draft

From: Philippe Verdy <verdy_p@wanadoo.fr>
Date: Thu, 21 Apr 2011 04:58:14 +0200
Message-ID: <BANLkTinybJSHeBepTYjs=MnCF7CPMpe_Qw@mail.gmail.com>
To: fantasai <fantasai.lists@inkedblade.net>
Cc: CE Whitehead <cewcathar@hotmail.com>, public-i18n-core@w3.org, public-i18n-indic@w3.org, public-i18n-cjk@w3.org, unicode@unicode.org
I disagree, because it breaks the inherent nature of the script. Joins
in Arabic are mandatory, and create "super grapheme clusters".

When you say that « it does not consider morphemic, syllabic, or other
boundaries », this is already wrong because it already considers the
default grapheme cluster boundaries. Note that the default grapheme
boundaries were designed only to be locale neutral. But here we are
speaking about localization where the language and its script will
matter, including in its fundamental properties. Joining types in
Arabic are key parts of the script.

But in the previous part of the specification, nothing speaks about
them, and all what is left on the upper levels where trying to find
language-correct boundaries will fail. After this level, there shoudl
still be a level related to the script itself (independantly of the
language), before trying the last-chance "emergency" breaks. This
intermediate level can still be prioritized, just as it was in the
previous steps.

Otherwise, chances are very high that even the exepected joining types
wil not even be rendered with the expected shape, and there will be
incorrect rendering of other elements in the now broken join, i.e.
characters that are not starters of default grapheme clusters.

It won't be worse even if it is not strictly a morphemic or syllabic
break. And in most cases, it will produce at least a correct syllabic
break, even if there was no morphemic analysis nor just syllabic
analysis (because this step is optional and much more complex to
implement). The joining analysis for Arabic is at least very simple to
compute (and fully standardized for the Arabic script, without any
linguistic knowledge).

And yes, even in that case you could still insert the hyphenation
symbol to show that the word was effectively broken (it is common
practice to insert it, even in the Latin script and even if this is
not the preferred syllabic or morphemic break position, which can only
be infered by language specific rules and a lookup dictionnary for
handling many exception cases).

The hyphenation symbol is generally very narrow, and if needed, it
cans still overflow a bit in the margin. I've never seen any practical
case where it could not be inserted, even in the narrowest columns of
a table, the only exception being when rendering with monospaced
fonts, with minimal column separation not larger than a thin space
(there should always be some minimal gap between columns of text, and
a small compression (kerning, glyph stretching) is still possible when
those characters already contain some inner advance gaps on both
sides, at least for the hyphenation symbol itself.

Note that overflow in the padding area does not cause this hyphen to
be completely invisible, even if the overflow is set to hidden.

The only case where it would not appear is when rendering on a
monospaced grid of a text terminal, where column separation is only
marked by distinct colors, or dictinct style attributes (bold, italic,
blinking, underline/overline/overstrike decorations...), and the
column is reduced to only one "character" (more precisely a single
glyph for the complete default grapheme cluster).

The choice of the hyphenation symbol is also a property of the script.
In many East and South-East Asian scripts, there's not even any symbol
for that, because break can occur between all grapheme clusters.

Note: in Indic scripts, the danda or double-danda punctuations should
be treated like the commas and stops in your spec and preferably not
left alone on the next line, even if it falls within the margin (you
showed cases for East-Asian scripts only : Han, Hiragana, Katakana,
Hangul, Bopomofo, Yi, Mongolian...)

But the same rule could as well apply to other "narrow" punctuations
used in Indic or European scripts such as the colon, semicolon,
exclamation mark, or single quotes that do not follow a non-breaking
space). The available margin at end of line typically accepts to fit
these punctuations in case of emergency situations, even if this makes
the margin slightly unaligned.

Philippe.

2011/4/21 fantasai <fantasai.lists@inkedblade.net>:
> On 04/20/2011 04:47 PM, Philippe Verdy wrote:
>>
>> [css3-text]:
>>
>> "7.2. Emergency Wrapping: the ‘word-wrap’ property
>> [...]
>> break-word
>>   An unbreakable "word" may be broken at an arbitrary point if there
>> are no otherwise-acceptable break points in the line. Shaping
>> characters are still shaped as if the word were not broken, and
>> grapheme clusters must together stay as one unit.[...]"
>>
>> Here I also suggest that contextually shaped characters should not
>> just keep their normal shaping, but the joining types should be taken
>> into account, to avoid breaking between joined character pairs, with a
>> higher precedence for disjoined characters.
>
> Actually, the fact that the join is broken has the advantage of making
> it more clear that this is an improper wrap. It is /better/ to break
> there than at a disjoint boundary.
>
> The purpose of "word-wrap: break-word" is to handle emergency cases,
> where there are no other breakpoints. It does not insert hyphens. It
> does not consider morphemic, syllabic, or other boundaries. It just
> breaks somewhere arbitrary to avoid overflow.
>
> So I disagree with your suggestion and believe the spec is correct
> as it stands.
>
> ~fantasai
>
Received on Thursday, 21 April 2011 03:00:57 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 21 April 2011 03:00:59 GMT