[css3-text] text-justify needs to be fundamentally rethought from John Daggett on 2012-12-04 (www-style@w3.org from December 2012)

From: John Daggett <jdaggett@mozilla.com>
Date: Mon, 3 Dec 2012 22:40:15 -0800 (PST)
To: www-style list <www-style@w3.org>
Message-ID: <1058445905.3423478.1354603215595.JavaMail.root@mozilla.com>
The CSS3 Text spec defines a property 'text-justify', used to
determine the style of justification when 'text-align: justify' is
used.  There is no normatively proposed justification algorithm, only
a rough categorization of scripts into groups and then property values
that assign different "priorities" to how expansion opportunities are
ranked based on script.

Line breaking and justification behavior in user agents today is
dictated by the script and language.  Western text uses a spacing
model where line breaks and expansion occur at word spaces.  For
Japanese, line breaks occur everywhere within CJK script runs except
where forbidden by explicit rules.  For Thai, line breaking occurs at
syllables, so a dictionary-based approach is needed to determine
word/syllable boundaries.

The use of ad-hoc script categories is highly problematic in my
opinion.  The categories are defined non-normatively in Appendix E and
they are not in any way exhaustive enough to inform an implementor to
know what to do in all cases.

As John Hudson stated last year [1]:

  As it stands, the proposed classification criteria seems
  confused and to be based on an idiosyncratic analysis that
  ends up forcing closely related writing systems into
  different categories; there may be good reason for these
  divisions based on line-breaking needs, but for anyone
  familiar with more typical script analysis the use of
  familiar terms in strange ways is confusing, as are the
  implied groupings. For instance, under the categorisation
  criteria, Devanagari and Bengali would be considered
  'connected scripts', while Gujarati and Oriya would be
  'discrete scripts', despite that fact that all four
  scripts are closely related, have historically been
  analysed as local variants of the same writing system, and
  share important features that are ignored by the proposed
  classification criteria.

  The term cursive is problematic because virtually any
  writing system can and has been written in a cursive form,
  even nominal 'block scripts'. There are plentiful examples
  of cursive Latin script, and in many instances these are
  analysable as being at the same time cursive and discrete,
  since the letters within words retain their discrete
  isolated shapes are are linked by joining strokes that are
  not part of the letter. This in contrast to Arabic, in
  which the joining strokes are part of the letters,
  replacing other strokes that occur in the isolated forms.
  So the distinction between Latin and Arabic is that the
  latter is morphographical, while both may be written in
  cursive styles. [This also raises the issue of the degree
  to which nominal script-level decisions about
  line-breaking and justification can be safely applied to
  particular styles and particular fonts. If a justification
  model permits inter-character spacing adjustment of
  'discrete' scripts, what is the effect on cursive font
  styles?]
  
At the San Diego F2F I noted that the use cases for 'text-justify'
values are very unclear in the spec [2].  An example was added of the
line breaking with different values but that begs more questions than
it answers. When is an author going to use one type of justification
versus another?  What differences in justification will they expect to
see? If this is primarily based on the language conventions for a
particular script, why does this need to be specified via a property
value?  User agents already do language-specific behavior, why does a
property value need to be set in addition?

I think property values for 'text-justify' need to address
justification behavior explicitly rather than inferring that via
script category prioritization.  Here again, John Hudson put it best
[1]:

  Again I come back to my previous point: if what the spec
  is trying to address is line-breaking and justification
  behaviour, coming at it from nominal script categorisation
  seems like a basic confusion of categories. We can get
  hung up on all sorts of concepts within grammatology, when
  really we don't need to if we instead start by defining
  line-breaking and justification behaviour types, and then
  look at how these map to individual scripts (with
  appropriate caveats or exceptions re. language, locale,
  style). That makes much more sense to me than starting by
  trying to categorise scripts according to unclear and
  non-discrete criteria and then trying to map these to
  line-breaking and justification behaviours. Start with the
  function.

Since this property as defined right now seems like an experimental
form, I would suggest defining only the behavior that's for which
there's a clear use case and leave the others to be defined later:

  text-justify: auto | distribute

where 'auto' is the user agent default behavior and 'distribute' means
expand inter-letter and inter-word spacing equally.

Regards,

John Daggett

[1] John Hudson on the problems of using script categorization for justification  
http://lists.w3.org/Archives/Public/www-style/2011Apr/0525.html
http://lists.w3.org/Archives/Public/www-style/2011Apr/0518.html
http://lists.w3.org/Archives/Public/www-style/2011Apr/0524.html
http://lists.w3.org/Archives/Public/www-style/2011Apr/0526.html

[2] http://lists.w3.org/Archives/Public/www-style/2012Aug/0897.html
Received on Tuesday, 4 December 2012 06:40:51 UTC