Re: Hyphenation

[Aristeu, please fix the settings of your Outlook so that it sends
plain text only, not text and (pseudo-)HTML, less than 80 chars
per line and please use Ascii characters only of possible, e.g. no "smart
quotes" in Windows-specific encoding.]

On Mon, 18 Jan 1999, Aristeu E B da Silva wrote:
[reformatted for readability]

> It is clear at ‘HTML 4.0 Specification’, item ‘9.3.3 Hyphenon’
> and because ‘CSS2 Specification’ doesn’t defines any "Hyphenated"
> attribute, that hyphenation is an author’s concern

I would say that hyphenation, being a presentational thing, is
_basically_ a user agent's concern, but authors may need or
wish to make their contributions. In principle, giving adequate
information about the natural language(s) used, via LANG attributes,
is the most essential thing to do. Although user agents currently
ignore those attributes, they are certainly the way to go.
Additional information from the author might be needed in
special cases, e.g. as hyphenation hints or prohibitions. A user
agent can hardly be expected to analyze e.g. whether "record" is
being used as a noun (to be hyphenated rec-ord) or as a verb

Fundamentally, hyphenation, if applied, needs to be done according to
language-specific rules, possibly applying some exceptions indicating
in the document itself in some notation. High-quality software might
apply quite complicated methods which give different weights to
possible hyphenation points (preferring e.g. a division of a compound
word at the compound boundary).

I don't think item 9.3.3 should be read as suggesting that authors
should generally include soft hyphens to indicate possible hyphenation
points. Rather, that they _may_ do so and user agents _may_ use them
in hyphenation.

> It looks fine to me, because -- as an author -- I can tell, not only,
> what should and what should not be hyphenated, but also, tell how this
> hyphenation should be performed. No matter the user agent’s language or
> hypothetic hyphenation algorithm, which should not exist at all.

The user agent's language, in the sense that it's _user interface_
may use some natural language (in menus, error messages, help files,
etc) should of course have nothing to do with hyphenation. What
matters is the language used in the _document_.

The idea of prehyphenation -- running the document through a utility
which determines the possible hyphenation points in the text and
includes some hyphenation hints, before putting the document onto
the Web -- would significantly increase document size and transfer time,
especially if soft hyphens are used as entities (and not as raw octets).
Admittedly it would simplify the user agent's task.

In practical terms, the soft hyphen is not supported by browsers,
and using it would result in serious problems in current browsers.
Moreover, I don't think the soft hyphen would be a good solution at all.
It seems obvious to me that using it as a hyphenation hint would not
comply with the _definition_ of soft hyphen in ISO 8859 standards.

> I would do that by inserting the ‘soft hyphen’ character enti
> ty (decimal 173) everywhere in my paragraphs where I want to allow
> hyphenation to be performed, and not doing so where I want not. I
> already have a software of my own which is able to do it in Portuguese.
> But, unfortunately, both IE4 and Netscape 4.5 does exactly what
> should not be done with decimal 173, that is, show them as plain
> hyphens. By doing so they’re preventing us on using hyphenated texts

I don't expect this to change. From ISO 8859 viewpoint, a soft hyphen
anywhere but at the end of a line is an anomaly, so any processing
in other contexts can be classified as error recovery. Not displaying
it at all might be more reasonable error recovery, but this would
imply ignoring actual data in a document.

Naturally HTML definition might assign special meanings to characters
(just as space characters have special semantics, not to mention
characters like <>&). It could define that a soft hyphen, or a normal
hyphen, or the letter h is to be treated as a hyphenation hint,
not as normal data character. But no HTML specification has _really_
defined a special meaning for the soft hyphen. The older specs were
written so that the soft hyphen _as defined by character set
standards_ was assumed to have the semantics of a "discretionary hyphen".
The HTML 4.0 tries to be more explicit, but the current formulation
imposes requirements on "those browsers that interpret soft hyphens"
without requiring that browsers must "interpret" them. So a conforming
browser should go on displaying soft hyphens as hyphens.

> In my opinion, hyphenation is not only an important lay-out feature,
> but it’s also ‘an cultural issue’, in Portuguese it’s ‘strange’ when the
> body text isn’t justified and hyphenated.

I agree on the importance of hyphenation. Justifying text is a different
thing -- text justified on both sides looks very often very odd when
the window is narrow -- but naturally _if_ text is justified it
should normally be hyphenated to get a decent result, especially
when very long words may occur.

Hyphenation needs to be programmed into browsers. Given the fact that
popular browsers are mammouths which do miscellaneous things with
very little to do with what a Web browser should really do, it would
be just decent to include some basic hyphenation rules into them.

For author's hyphenation hints or prohibitions, there is really no
single _character_ which could logically be assigned to the job.
It's more like a job for tags. For prohibitions, most browsers
seem to support the <nobr> tags. It should probably be promoted
to a real element, defined as text-level markup in a future HTML
specification. An obvious solution would be to introduce an empty
element, say <hy>, for the purpose, so one could write
rec<hy>ord. Unlike &shy, this would degrade gracefully on browsers
which do not support it. Moreover, the element might take
an attribute with a numeric value indicating the level of
acceptability of word division at that point, ranging from
a value indicating a most preferred point (such as between
the constituents of a compound word) to a value which suggests
that hyphenation should be applied only if absolutely necessary.

Alternatively, hyphenation hints (and perhaps hyphenation
prohibitions too) could be regarded as purely presentational,
to be handled in style sheets. But I'd say it would be less
practical to write <span id="someid">record</span> and then
a CSS rule for that particular occurrence of the word "record".
And since hyphenation is sometimes related to the _meaning_ of
a word in a natural language, hyphenation hints can be regarded
as part of the structure of a document in a sense. (The same applies
to pronunciation hints/information. One might say that _ideally_
an author should be able to specify, in HTML markup, the _meaning_
of a word like "record", by referring to a dictionary entry in
a specific format, useful both for hyphenation and pronunciation,
as well as automatic analysis of the document for translation or
other purpose; and a user agent might make it an implicit link,
so that the user may request for a definition of the word from
the dictionary.)

> What is W3C’s position about it, will this approach be changed
> or should we wait the User Agents to change? Will they?
> Aristeu Escobar Branco da Silva
> São Paulo, Brasil.

Yucca, or

Received on Thursday, 21 January 1999 03:13:18 UTC