Re: Soft hyphen from Martin J. Duerst on 1997-05-13 (www-html@w3.org from May 1997)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Tue, 13 May 1997 12:19:14 +0200 (MET DST)
To: Otto Stolz <Otto.Stolz@uni-konstanz.de>
cc: Multiple Recipients of <unicode@Unicode.ORG>, www-html <www-html@w3.org>, ISO 10646 mailing list <iso10646@listproc.hcf.jhu.edu>
Message-ID: <Pine.SUN.3.96.970513120716.245c-100000@enoshima>

On Tue, 13 May 1997, Otto Stolz wrote:

> On May 12, 10:25, Mark Davis <mark_davis@taligent.com> wrote:
> > You can insert a zero width no-break space, if you want to prevent a
> > word-break at a particular point.
> 
> This is not feasable. You cannot anticipate which weird points an
> arbitrary browser (or some other rendering sogtware) might deem legal
> hyphenating points. To be on the safe side, you would have to insert
> those Z-WNBSPs between any two adjacent letters, thus almost doubling
> the length of your text. 
> 
> Hence, the only feasable solution is:
> - for the sender: mark all preferred hyphenating points,
> - for any browsing, or rendering, software: do not hyphenate within
>   a word but at hyphenating points marked so by the sender.

There are other possibilities. For example, you can language-tag
your text (the discussion is, at least originally, about HTML)
and hope for the receiver to know about hyphenation. You then
only insert a SHY in places where the receiver can't possibly
know (e.g. re-cord vs. rec-ord, word compositions as they
occur in German and Nordic languages and so on). You can also
add hyphenation points in otherwise very long words to help
receivers that don't have a hyphenation engine for the respective
language. The benefit of hyphenation points increases rather
quickly with the length of a word.
A usual convention for marking a word that should not be
hyphenated is to prefix a SHY.

> To mark the points of possible line-breaks:
> - for languages that do hyphenation, SHY (U+00AD) seems the only
>   character suitable to mark hyphenation points (in spite of the
>   obfuscationg wording in ISO 8859-1);

Indeed. The conclusion from the official description seems to be
that a SHY was only intended to be inserted at the end of the line
when the line break actually occurs. Because it was never supposed
to appear inside a line, using it to denote a potential word
break if it appears inside a line is only an extension of its
use, and not directly against that wording in ISO 8859-1. And
it's of corse the most reasonable and usable extension.

> - for languages that don't use spaces as word-boundaries, a Z-WSp
>   (U+200B) seems suitable to mark the word boundaries.
> Opinions?

Yes. But you only need it for languages that indeed need to know
word boundaries to do line breaking (such as Thai). You don't
need it in cases such as Chinese and Japanese, where you can
break the line between virtually all characters, and the
exceptions can easily be determined.

Regards,	Martin.

Received on Tuesday, 13 May 1997 06:19:46 UTC