RE: CSS3 Text - Edit suggestions

The rational of moving hypenation from a script-specific context to an open application where it may (or may not apply) is that we do not know all of the places where hyphenation might apply. Rather than having us be experts in all scripts, having generic wording that allows UAs to implement hyphenation and word breaking to the best of their ability (and hopefully improve over time), a generic statement will help the specification be more robust.
I believe that a lot of effort has been put into helping UAX 14 be implementable. I would have a higher level of confidence in making the line breaking dependent on UA implementation than calling out that breaking generally occurs "at" punctuation. I don't know how "at punctuation" should be implemented and therefore see ambiguity introduced. I believe we should avoid introducing this ambiguity and encourage the use of standards whose purpose it is to provide the right kind of data.
> In these systems a line can break anywhere
> <em>except</em> between certain character combinations.
Is the plan to list all of the combinations? Or, is there a normative document that can be referenced?
The issue with Tibetan justification is that groups like FLOSS have read the working draft document are then trying to figure out how to implement it. That is unfortunate because it is not a useful expediture of the volunteer's time. If one considers wood blocks and wants to emulate them then it may be beneficial. However, if you leave the information in the spec there will be many people who think that is the norm. Lets discuss this at the F2F meeting.
I second the idea that we talk about hyphenation at the F2F meeting at end of March.


From: on behalf of fantasai
Sent: Mon 2/19/2007 4:30 PM
To:; 'WWW International'
Subject: Re: CSS3 Text - Edit suggestions

Paul Nelson (ATC) wrote:
> Following are suggestions for changes to wording to Text draft:
> 1.       Section 4. Line Breaking and Word Boundaries

That's a lot of wording changes. I'm going to go through this section
by section to see if I understand the rationale for your changes, ok?

I'm guessing that one issue was general flow and organization of the
paragraphs needed improvement.

> In the absence of hyphenation, a line break by and large occurs at
> explicit word boundaries.

Are you moving the "in the absence of hyphenation" out from a script-specific
context so that it applies to all scripts?

It doesn't apply to Chinese/Japanese and similar scripts, because they only
restrict breaking based on punctuation and syllables, not words. Korean
breaks on either syllables and punctuation like Japanese, or it breaks on
spaces, which IIRC, aren't delimiting words but some other grammatical
construct. For Thai and similar scripts, they only break at word boundaries
but those boundaries aren't explicit...

> In many writing systems, word boundaries are defined by spaces or punctuation,

I had written
  "In many writing systems, words are always separated by spaces or punctuation."
Is the change simply editorial, to tie in the next phrase about glue, or was
there some other motivation as well?

> with specific rules as to which
> characters glue to the following characters and which characters glue to
> the preceding characters.

Hm, if we get rid of the notion of a line break opportunity being "at"
a punctuation character, then we don't need to introduce this idea of glue...
The rules about breaking before/after a given punctuation mark should be
evident in UAX14. I'll try to reword these sentences, and see if it works.

> More information on determining word boundaries can be found in UTR-24.

UTR 24 is Script Names. Did you mean UAX 14 Line Breaking Properties?
Word boundaries (which are different from line breaking boundaries)
are defined in UAX 29 Text Boundaries.

> UAs should follow that specification to determine line and word boundaries.

Every time UAX 14 comes up, some member of the WG notes that taking UAX
14 literally doesn't work well. Therefore I've been careful to reference
it, but leave that reference non-normative so that implementors can apply
their own judgement to the information it contains.

> In a number of scripts white space is not used to separate words. For
> example, in Chinese, Japanese and Yi, space characters are not used
> between words and, as a general rule, it is permissible to break lines
> between any ideographic characters or syllabic clusters. Scripts like
> Thai, Khmer, and Lao also do not have space characters between words.
> However, unlike Chinese, Japanese and Yi, it is not generally accepted
> to break between syllabic clusters. As a result, a lexical resource is
> required to correctly define word boundaries.

Ok, here's what I've got so far:

   | For most scripts in the absence of hyphenation, a line break only
   | occurs at word boundaries. Many writing systems use spaces or
   | punctuation to explicitly separate words, and line break opportunities
   | can be identified by these characters. Scripts like Thai, Lao, and
   | Khmer, however, do not use spaces or punctuation to separate words.
   | Although the zero width space (U+200B) can be used as an explicit word
   | delimiter in these scripts, this practice is not common. As a result,
   | a lexical resource is needed to correctly identify break points in such
   | texts.
   | In several other writing systems, (including Chinese, Japanese, Yi,
   | and often also Korean) line break opportunities are based on syllable
   | boundaries, not words. In these systems a line can break anywhere
   | <em>except</em> between certain character combinations. Additionally
   | the level of strictness in these restrictions can vary with the
   | typesetting style.</p>


> 2.       Recommend hyphenation properties from XSL-FO be adopted to add
> additional hyphenation properties that were not here to fore defined.

I'm going to remove hyphenation from this draft and leave it as an issue. As
I said before, this needs some big discussion and brainstorming at the F2F.

I certainly don't want to simply copy XSL's hyphenation properties--their
syntax is horrible. (They have names like 'hyphenation-push-character-count',
which might make sense to a hyphenation engine implementor, but doesn't make
any sense to me and probably wouldn't make any sense to a typical CSS author

> 3.       Section 5.1 text-wrap property, unrestricted value definition,
> change last sentence to: "Character shaping is performed on each side of
> the break as if the break had not occurred."

Done. That's much better wording. :)

> 4.       text-align property, left value definition. I believe that the
> text for vertical is not correct, but should be defined as "In vertical
> text, 'left' is interpreted with respect to the left side of the line if
> the baseline was rotated to horizontal."
> 5.       text-align property. right value definition. I believe that the
> text for vertical is not correct, but should be defined as "In vertical
> text, 'right' is interpreted with respect to the right side of the line
> if the baseline was rotated to horizontal."

Hmm, I see what you mean. I'm not sure how that would interact with
any glyph-orientation stuff we add in later, so I've put
   In vertical text, 'left' aligns to the edge of the line box that
   would be the start edge for left-to-right text.
   In vertical text, 'right' aligns to the edge of the line box that
   would be the end edge for left-to-right text.

Does that do what you want?

> 6.       text-justify property - Why not go ahead and remove the tibetan
> value in this draft instead of making a note that it will be removed?

Because the draft does collect useful information about the traditional
justification method (including the fact that inter-word is preferred in
modern typesetting), and I'd like to see that published somewhere rather
than lost in the hidden recesses of the W3C's cvs system.

> 7.       Change after "The exact justification algorithm is
> UA-dependent;" to read "however, CSS provides some general guidelines
> that should be followed when a justification method other than 'auto' is
> specified." This wording fit better with the idea that the UA chooses
> how to do justification.


> 9.       ANSWER: Tamil justifies in the same way as do clustered.
> Actually the scripts under "connected" justify in the same way as
> "clustered". I believe that "connected" as a strategy should be removed.

I've never seen a connected script justified as clustered. Same with Tamil:
I've only seen it using inter-word. Admittedly, though, I don't have much
exposure to Indic publications. Do you have some examples?

Examples of Indic scripts justifying as inter-word:
   (The justification is very obvious under the green heading. It
   only expands spaces, unless there are no spaces in which case
   it falls back to spacing grapheme clusters, just like in English.)
   (The justification is very obvious in the second paragraph, but
   unlike the disconnected script above, there seems to be no
   attempt at spacing between grapheme clusters.)

> 10.   Rather than using "flex points" I believe we should use "expansion
> opportunities".

I like that. Is it ok to call them "expansion opportunities", though,
when they can both expand and contract? (Steve, any comments?)

> [NOTE: I can provide more text on this based on documentation that I
> have written on this topic.]

Now I'm curious. What did you have in mind?

> 11.   word-spacing property. This needs to be thought through better.
> There is only one <spacing-limit> value, yet the property is supposed to
> "specify the minimum, maximum and optimal spacing between words."

<spacing-limit> is currently defined as three values: :
   # [ normal | <length>  | <percentage> ] {1,3}

(In the previous draft I'd specified it as only one, and applied
the {1,3} multiplier in the individual property definitions. Do
you think I should change it back for clarity?)

> The NO BREAK SPACE (U+00A0) should not be considered to be a word separating
> character. This is used to glue parts of words together when a white
> space between characters is needed, as used with Persian, Uighur and
> others.

NO BREAK SPACE is also used in European typography to replace a
space character when line breaking needs to be suppressed. It
decomposes to <noBreak> SPACE (U+0020).

I thought Persian used ZWNJ, not NBSP. In what circumstances would
Persian use NBSP but not want it the same width as a regular space
(regardless of its lexical semantics: we're talking presentation
only here)?

> Furthermore, this property should not apply to Tibetan.

Do you mean it should not apply to the tsek mark, or did you mean
something different?

One problem with removing the tsek mark from that list would be that
one loses control over limiting its justification stretchiness: it
would neither fall under word-spacing's control nor under
letter-spacing's control (because Tibetan justifies at the tsek,
not at every grapheme cluster boundary like Thai). Also, if we remove
it from the list of characters affected by word-spacing, then it needs
to be special-cased in the justification section.

> 12.   letter-spacing property. Same as with word-spacing property. I
> would contend that letter-spacing, if applied, only has one value as
> specified, but that one value does not represent either minimum, maximum
> or optimal spacing. Letter spacing should be applied to the text before
> justification is applied and should be considered to be the minimal
> amount of letter spacing.

That would be incompatible with CSS1 and CSS2, though. They both
require that a non-'normal' value for letter-spacing suppresses any
expansion/contraction due to justification.

Note that sophisticated justification systems like those in Adobe InDesign
require controls for minimum, maximum, and optimal spacing.

> 13.   7.2 text should be changed to read: "UAs may apply letter-spacing
> to cursive scripts. In this case, UAs should extend the space between
> graphemes as appropriate, rather than leaving a gap.

Mmm, extending space to me means leaving a gap.. Did you mistype that
sentence, or am I missing something?

> [NOTE: The text as written is not appropriate as UAs may be relying on
> OS level services for text/font rendering. Thus, telling the UA they
> must not apply letter-spacing is not realistic because the UA has no way
> to know some of those types of situations. I believe this is informative
> and wording should be given as such.]

Ok, I see the problem. But we should make it either a SHOULD or a MAY,
so that the behavior is normatively defined and UAs are encouraged to
avoid breaking cursive connections if they can. I've put SHOULD for now:

  | If the UA cannot expand a cursive script without breaking
  | the cursive connections, it should not apply letter-spacing between
  | grapheme clusters of that script at all.

> 14.   The text-indent property should be able to take a negative amount
> for an outdent effect. If the need is to remain within the bounding box
> an alternative could be to have the first line start at the edge of the
> box and the following start edges would be inset the amount of the
> outdent to give the same appearance.

The text-indent property has always been able to take a negative amount
for an outdent effect, but it has also, unfortunately, always been defined
to pull the text out of the bounding box. We can't change that now, hence
the 'hanging' keyword. (There was a lot of discussion on how to do this
in 2001/2002, see
and .)

> 15.   Link for UAX29 points to TR24 instead of TR29

Thanks, fixed.

> General comments:
> 1.       We need to have some discussion on hyphenation impact on
>          word-wrap properties.

If you mean 'word-wrap' specifically, then hyphenation takes precedence
over 'break-word'. (The 'break-word' behavior only allows breaks if
"an otherwise-unbreakable string is too long".)

> 2.       Great job. Slowly making progress!

Thanks! And thanks for your comments!


Received on Monday, 19 February 2007 11:54:34 UTC