- From: Philippe Verdy <verdy_p@wanadoo.fr>
- Date: Thu, 21 Apr 2011 04:58:14 +0200
- To: fantasai <fantasai.lists@inkedblade.net>
- Cc: CE Whitehead <cewcathar@hotmail.com>, public-i18n-core@w3.org, public-i18n-indic@w3.org, public-i18n-cjk@w3.org, unicode@unicode.org
I disagree, because it breaks the inherent nature of the script. Joins in Arabic are mandatory, and create "super grapheme clusters". When you say that « it does not consider morphemic, syllabic, or other boundaries », this is already wrong because it already considers the default grapheme cluster boundaries. Note that the default grapheme boundaries were designed only to be locale neutral. But here we are speaking about localization where the language and its script will matter, including in its fundamental properties. Joining types in Arabic are key parts of the script. But in the previous part of the specification, nothing speaks about them, and all what is left on the upper levels where trying to find language-correct boundaries will fail. After this level, there shoudl still be a level related to the script itself (independantly of the language), before trying the last-chance "emergency" breaks. This intermediate level can still be prioritized, just as it was in the previous steps. Otherwise, chances are very high that even the exepected joining types wil not even be rendered with the expected shape, and there will be incorrect rendering of other elements in the now broken join, i.e. characters that are not starters of default grapheme clusters. It won't be worse even if it is not strictly a morphemic or syllabic break. And in most cases, it will produce at least a correct syllabic break, even if there was no morphemic analysis nor just syllabic analysis (because this step is optional and much more complex to implement). The joining analysis for Arabic is at least very simple to compute (and fully standardized for the Arabic script, without any linguistic knowledge). And yes, even in that case you could still insert the hyphenation symbol to show that the word was effectively broken (it is common practice to insert it, even in the Latin script and even if this is not the preferred syllabic or morphemic break position, which can only be infered by language specific rules and a lookup dictionnary for handling many exception cases). The hyphenation symbol is generally very narrow, and if needed, it cans still overflow a bit in the margin. I've never seen any practical case where it could not be inserted, even in the narrowest columns of a table, the only exception being when rendering with monospaced fonts, with minimal column separation not larger than a thin space (there should always be some minimal gap between columns of text, and a small compression (kerning, glyph stretching) is still possible when those characters already contain some inner advance gaps on both sides, at least for the hyphenation symbol itself. Note that overflow in the padding area does not cause this hyphen to be completely invisible, even if the overflow is set to hidden. The only case where it would not appear is when rendering on a monospaced grid of a text terminal, where column separation is only marked by distinct colors, or dictinct style attributes (bold, italic, blinking, underline/overline/overstrike decorations...), and the column is reduced to only one "character" (more precisely a single glyph for the complete default grapheme cluster). The choice of the hyphenation symbol is also a property of the script. In many East and South-East Asian scripts, there's not even any symbol for that, because break can occur between all grapheme clusters. Note: in Indic scripts, the danda or double-danda punctuations should be treated like the commas and stops in your spec and preferably not left alone on the next line, even if it falls within the margin (you showed cases for East-Asian scripts only : Han, Hiragana, Katakana, Hangul, Bopomofo, Yi, Mongolian...) But the same rule could as well apply to other "narrow" punctuations used in Indic or European scripts such as the colon, semicolon, exclamation mark, or single quotes that do not follow a non-breaking space). The available margin at end of line typically accepts to fit these punctuations in case of emergency situations, even if this makes the margin slightly unaligned. Philippe. 2011/4/21 fantasai <fantasai.lists@inkedblade.net>: > On 04/20/2011 04:47 PM, Philippe Verdy wrote: >> >> [css3-text]: >> >> "7.2. Emergency Wrapping: the ‘word-wrap’ property >> [...] >> break-word >> An unbreakable "word" may be broken at an arbitrary point if there >> are no otherwise-acceptable break points in the line. Shaping >> characters are still shaped as if the word were not broken, and >> grapheme clusters must together stay as one unit.[...]" >> >> Here I also suggest that contextually shaped characters should not >> just keep their normal shaping, but the joining types should be taken >> into account, to avoid breaking between joined character pairs, with a >> higher precedence for disjoined characters. > > Actually, the fact that the join is broken has the advantage of making > it more clear that this is an improper wrap. It is /better/ to break > there than at a disjoint boundary. > > The purpose of "word-wrap: break-word" is to handle emergency cases, > where there are no other breakpoints. It does not insert hyphens. It > does not consider morphemic, syllabic, or other boundaries. It just > breaks somewhere arbitrary to avoid overflow. > > So I disagree with your suggestion and believe the spec is correct > as it stands. > > ~fantasai >
Received on Thursday, 21 April 2011 03:03:12 UTC