Re: Soft hyphen (Re: Cougar comments) (fwd)

Another message forwarded from the Unicode list.

Regards,	Martin.



From: kenw@sybase.com (Kenneth Whistler)
Message-Id: <9705121907.AA04212@birdie.sybase.com>
To: unicode@unicode.org
Subject: Re: Soft hyphen (Re: Cougar comments)
Cc: mduerst@ifi.unizh.ch, kenw@sybase.com, jkorpela@cc.hut.fi, 
    davidp@earthlink.net
X-Sun-Charset: ISO-8859-1


Regarding the issue of the interpretation of the ISO/IEC 8859-1
0xAD SOFT HYPHEN (and its equivalent in Unicode/10616, U+00AD
SOFT HYPHEN), I must say that I believe Martin has got it
correct. The intent was to encode a discretionary hyphen which would
be imaged as a hyphen at a line break, but not elsewhere in a
line of text.

The relevant text from ISO/IEC 8859-1 being debated is:

> > 	A graphic character that is imaged by a graphic symbol identical
> > 	with, or similar to, that representing HYPHEN, for use when
> > 	a line break has been established within a word.

(cited as Section 6.3.3; currently Section 5.3.3 in the DIS of
ISO/IEC 8859-1 under ballot now, worded exactly the same way.)

Before launching into deconstruction of that text out of context,
one should first consider the following facts:

1. ISO/IEC 8859-1 (or any other SC2 character encoding standard) does
not dictate to implementers how they should implement the standard;
instead it sets forth the requirements for conformance to the
standard. I can verify a claim of conformance against the criteria
for conformance in the standard. But if you choose to implement
something differently within those constraints, it is up to you.
(For example, I could implement a text editor that chose to insert
an extra space after every character, or that refused to insert a "w"
into text, replacing it with "uu"--stupid ideas, of course, but
conformant to ISO 8859-1 if the use of encoded character values followed
the standard.)

2. The usage of characters in ISO/IEC 8859-1 (or other SC2 character
encoding standards) arose from a preexisting context of usage.
U+00AD SOFT HYPHEN came out of a context of usage for
XCCS (Xerox Corporate Character Set) 357B/043B discretionary hyphen
and for IBM CDRA SP320000 syllable hyphen. Both of those characters
were distinguished from hyphen (or hyphen/minus) by those preexisting
corporate character standards, were intended for encoding of the
now-you-see-it-now-you-don't hyphen only at line breaks, and informed
the character repertoire chosen for the establishment of ISO/IEC 8859-1,
"Latin 1".

David Perrell commented, in support of Jukka Korpela's contentions:

> Call me indecent, but I disagree. A "soft" hyphen is a visible
> character that is inserted by a text formatter after a line break
> within a word has been established. In other words, when a text
> formatter determines that a word will be broken and the second part
> will begin a new line, the formatter inserts a soft hyphen after the
> first part of the word rather than a "hard" hyphen. If the text is
> later reformatted, the soft hyphen may be easily removed when it no
> longer falls on a line break, whereas the "hard" hyphen is left in the
> text regardless of its position.

This is exactly an example where differences of implementation set in.
A text formatter can behave that way, sure. But many word processors
choose a different implementation and typically do. In addition to
automatic insertion of a hyphen *glyph* at the end of a line broken in
the middle of a word, they may support the user insertion of discretionary
hyphens in words. The discretionary hyphen is typically suppressed in
rendering (unless special edit modes are turned on to display it), but
is retained in text, and most importantly, is visible to the hyphenation
algorithm which uses it to find preferred hyphenation points which may
override what would be introduced by the automatic algorithm. When a
break occurs at the discretionary hyphen, the same hyphen *glyph* is
displayed that would occur for a hard hyphen or an automatic line break.

Storage in text of that discretionary hyphen is the normal and most
general usage of a U+00AD SOFT HYPHEN (or its kin in other character
encoding standards). For such implementations, the statement in 8859-1
regarding imaging of SOFT HYPHEN should be construed as "*When* an
implementation determines that it will image a SOFT HYPHEN, it ...
is imaged by a graphic symbol identical with, or similar to, that
representing HYPHEN..."

> Here is a soft hy­phen.

O.k.

> Does your mail reader support 8859?

8859-1, yes.

> Did your mail reader ignore the hyphen because it doesn't fall at
> the end of the line?

Of course not. My mail reader is a braindead renderer that also
supplies such constructions as "^O" for uninterpretable characters
it receives in the mailstream. It doesn't rehyphenate or reformat
lines either, so I don't expect it to do other than attempt to
display a visible glyph for each character it encounters in 8859-1
data. Hence the visible glyph.

Finally, before taking the 8859-1 text about display of 0xAD as
gospel truth, one should also consider the consequences of making
use of discretionary hyphens in scripts other than Latin. The
phrase "or similar to" should be construed fairly broadly there,
since the introduction of a line break does not always automatically
imply the insertion of a hyphen *glyph* at the end of each line
so broken. The Unicode Standard itself has something to say on
this. See page 6-5:

"U+00AD SOFT HYPHEN indicates a hyphenation point, where a line-
break is preferred when a word is to be hyphenated. Depending on
the script, the visible rendering of this character when a line break
occurs may differ (for example, in some scripts it is rendered as
a *hyphen* -, while in others it may be invisible)."

For 8859-1 taken only in its own context, where the 8859-1 characters
are sufficient only for representation of a subset of the Latin
script, the assertion that the SOFT HYPHEN images as a hyphen is basically
correct. But 8859-1 is now firmly embedded in the context of
ISO/IEC 10646-1, and the usage of SOFT HYPHEN must now be considered
in the universal character set for multiple scripts, most of them
not Latin-based.

--Ken Whistler

Received on Monday, 12 May 1997 16:20:23 UTC