Re: Feedback on hyphenation properties

On Aug 5, 2010, at 3:02 AM, Håkon Wium Lie wrote:

> Also sprach Simon Fraser:
>> We are not keen on "hyphens" as a property name. This doesn't match
>> other CSS property names which are mostly descriptive. We suggest
>> "hyphenation" or "hyphenate" instead. Most word processing and
>> desktop publishing programs usually refer to this behavior as "hyphenation".
> The CSS WG had a long disucssion on this in 2007:

Thanks for that link. I suspected that this must have been discussed, but
I guess it wasn't on this list.

> As you can see, the property used to be called 'hyphenate' but was
> changed to make it different from XSL. I think the new 'hyphens' work
> well -- it's shorter an easier to type. 

I don't know much about XSL, but is there a reason to keep the CSS property
names different from terms that XSL uses? Does this cause actual problems in real
use, or is it just to avoid developer confusion?

>> One thing to bear in mind is that if we want a shorthand property
>> in future, we may wish to reserve "hyphenation" or "hyphenate" for the
>> shorthand, and use "hyphenation-mode"/"hyphenate-mode" for the longhand.
>> Another consideration is whether hyphenation should be controlled by
>> a new value for the "word-break" property.
>> The property names "hyphenate-before" and "hyphenate-after" don't convey
>> their purpose very well. The naive reader may assume that they are used
>> to specify characters before/after which splitting is allowed.
>> They are really "keep at least N characters before/after the
>> hyphen", which suggest they should have "min" in their names.
>> Unfortunately no succinct alternatives spring to mind.
> I agree that it's hard to understand what these properties mean unless
> you know what knobs are normally offer for hyphenation. XSL calls them:
>  'hyphenation-push-character-count'
>  'hyphenation-remain-character-count'
> Few people will ever use these knobs, though, and the current names
> fit nicely into the 'hyphenate' family.

Another reason the current names are confusing is because it's easy
to interpret them in reverse. An author seeing "hyphenate-before" may
assume it means that it's OK to split the word N characters *before the
end* (and, likewise for hyphenate-after, N characters *after the beginning*).

>> Do we really need both "hyphenate-before" and "hyphenate-after" properties,
>> or would a single "hyphenation-min-fragment-length" property suffice?
> I think we do. The same argument was put forward for 'widows' and
> 'orphans' (another set of properties that are hard to get for the
> naive reader :-), but having both is important in some cases. For
> example, I use them extensively here:

OK, sounds good.

>> "hyphenate-lines" also doesn't convey its purpose very well. It's about
>> the maximum number of consecutive hyphenated lines. It's also odd to
>> have a "no-limit" value, rather than choosing a property name which
>> makes sense with a value of "none".
>> Finally "hyphenate-character" is odd in that the value takes a string,
>> not just a single character.
> I don't believe we have a character type. For the languages I know, it
> makes sense to only use one character. But I'm sure someone will find
> something creative to do with strings.
> Is your request based on constraints in the underlying text
> composition engine?

No, simply an apparent contradiction between the property name and
the allowed values.

>> Hyphenation resources
>> ---------------------
>> We think the "hyphenate-resource" property is problematic for two reasons.
>> First, the dictionary format is unspecified and there is no "type" parameter
>> for the resource, so there's no information the UA can use to determine
>> the format. This is especially problematic if the UA relies on some
>> underlying infrastructure for word breaking, and needs to pass the resource
>> down to this infrastructure.
>> Secondly, simply supplying a list of resources to be checked in order
>> is problematic, because it may result in in appropriate hyphenation.
>> If no hyphenation opportunities are found for a given word in a given
>> language by consulting the first resource, then the algorithm suggests
>> checking the second resource, which may return a hyphenation opportunity.
>> However, it may do so for the wrong language.
> The language issue can be avoided by setting different resources for
> different languages, no?
>  :lang(fr) { hyphenate-resource: url(foo), url(bar) }
>  :lang(en) { hyphenate-resource: url(foobar), url(barfoo) }

Sure, if this is understood by the author. But, as we know, authors don't
really understand these things, and may just string a whole bunch of
dictionaries from different languages together.

> The reason for having a comma-separated list is to allow different
> hyphenation resource formats to be supplied.
> The only such format I know is the format used by TeX and OpenOffice:

I don't see any description of the format of the hyphenation dictionary
there. I think if the CSS spec references a dictionary type, we need
to at least specify what the format is, and ideally link to a normative
reference on said format.

Its also unlikely that we'll support this format in WebKit on Mac, since
we rely on an underlying framework for hyphenation, and it has its
own dictionaries supplied with the OS.

>> Finally we think that doing language-sensitive hyphenation is hard
>> because most web content does not have the appropriate "lang" attributes.
>> We'd like to suggest a property that permits language-sensitive hyphenation,
>> namely "hyphenation-locale" (or "hyphenate-locale"), that an author can use
>> to inform the UA about what locale should be used for hyphenation:
>> hyphenation-locale: auto | string
>> where the string is a locale identifier.
>> If not auto, the value would override the language derived from any present
>> "lang" attributes.
> This would remove an incentive to start using the 'lang' attribute. If
> we want to encode such information in CSS (I'm not sure we do)

One reason we see a need for this is that we have to hyphenate content
that lacks "lang" attributes, but for which there is out-of-band data about
the language (e.g. EPUB). Another reason is that we may be able to deduce
something about the language from analysis of the content, and thus need
to propagate the results of that analysis to the hyphenation system somehow.

> it may
> be better to offer a property that can also be used outside of
> hyphenation, no? E.g.:
>  body { locale: 'en' }

This is a reasonable suggestion. Knowledge about the language is also
used for collation (e.g. for "find" algorithms), and for font substitution, so it seems
reasonable to have a property independent of hyphenation for those things as well.


Received on Thursday, 5 August 2010 19:00:13 UTC