- From: Martin J. Duerst <mduerst@ifi.unizh.ch>
- Date: Mon, 12 May 1997 21:27:56 +0200 (MET DST)
- To: ISO 10646 mailing list <iso10646@listproc.hcf.jhu.edu>, www-html <www-html@w3.org>
Another message forwarded from the Unicode list. Regards, Martin. From: kenw@sybase.com (Kenneth Whistler) Message-Id: <9705121907.AA04212@birdie.sybase.com> To: unicode@unicode.org Subject: Re: Soft hyphen (Re: Cougar comments) Cc: mduerst@ifi.unizh.ch, kenw@sybase.com, jkorpela@cc.hut.fi, davidp@earthlink.net X-Sun-Charset: ISO-8859-1 Regarding the issue of the interpretation of the ISO/IEC 8859-1 0xAD SOFT HYPHEN (and its equivalent in Unicode/10616, U+00AD SOFT HYPHEN), I must say that I believe Martin has got it correct. The intent was to encode a discretionary hyphen which would be imaged as a hyphen at a line break, but not elsewhere in a line of text. The relevant text from ISO/IEC 8859-1 being debated is: > > A graphic character that is imaged by a graphic symbol identical > > with, or similar to, that representing HYPHEN, for use when > > a line break has been established within a word. (cited as Section 6.3.3; currently Section 5.3.3 in the DIS of ISO/IEC 8859-1 under ballot now, worded exactly the same way.) Before launching into deconstruction of that text out of context, one should first consider the following facts: 1. ISO/IEC 8859-1 (or any other SC2 character encoding standard) does not dictate to implementers how they should implement the standard; instead it sets forth the requirements for conformance to the standard. I can verify a claim of conformance against the criteria for conformance in the standard. But if you choose to implement something differently within those constraints, it is up to you. (For example, I could implement a text editor that chose to insert an extra space after every character, or that refused to insert a "w" into text, replacing it with "uu"--stupid ideas, of course, but conformant to ISO 8859-1 if the use of encoded character values followed the standard.) 2. The usage of characters in ISO/IEC 8859-1 (or other SC2 character encoding standards) arose from a preexisting context of usage. U+00AD SOFT HYPHEN came out of a context of usage for XCCS (Xerox Corporate Character Set) 357B/043B discretionary hyphen and for IBM CDRA SP320000 syllable hyphen. Both of those characters were distinguished from hyphen (or hyphen/minus) by those preexisting corporate character standards, were intended for encoding of the now-you-see-it-now-you-don't hyphen only at line breaks, and informed the character repertoire chosen for the establishment of ISO/IEC 8859-1, "Latin 1". David Perrell commented, in support of Jukka Korpela's contentions: > Call me indecent, but I disagree. A "soft" hyphen is a visible > character that is inserted by a text formatter after a line break > within a word has been established. In other words, when a text > formatter determines that a word will be broken and the second part > will begin a new line, the formatter inserts a soft hyphen after the > first part of the word rather than a "hard" hyphen. If the text is > later reformatted, the soft hyphen may be easily removed when it no > longer falls on a line break, whereas the "hard" hyphen is left in the > text regardless of its position. This is exactly an example where differences of implementation set in. A text formatter can behave that way, sure. But many word processors choose a different implementation and typically do. In addition to automatic insertion of a hyphen *glyph* at the end of a line broken in the middle of a word, they may support the user insertion of discretionary hyphens in words. The discretionary hyphen is typically suppressed in rendering (unless special edit modes are turned on to display it), but is retained in text, and most importantly, is visible to the hyphenation algorithm which uses it to find preferred hyphenation points which may override what would be introduced by the automatic algorithm. When a break occurs at the discretionary hyphen, the same hyphen *glyph* is displayed that would occur for a hard hyphen or an automatic line break. Storage in text of that discretionary hyphen is the normal and most general usage of a U+00AD SOFT HYPHEN (or its kin in other character encoding standards). For such implementations, the statement in 8859-1 regarding imaging of SOFT HYPHEN should be construed as "*When* an implementation determines that it will image a SOFT HYPHEN, it ... is imaged by a graphic symbol identical with, or similar to, that representing HYPHEN..." > Here is a soft hyphen. O.k. > Does your mail reader support 8859? 8859-1, yes. > Did your mail reader ignore the hyphen because it doesn't fall at > the end of the line? Of course not. My mail reader is a braindead renderer that also supplies such constructions as "^O" for uninterpretable characters it receives in the mailstream. It doesn't rehyphenate or reformat lines either, so I don't expect it to do other than attempt to display a visible glyph for each character it encounters in 8859-1 data. Hence the visible glyph. Finally, before taking the 8859-1 text about display of 0xAD as gospel truth, one should also consider the consequences of making use of discretionary hyphens in scripts other than Latin. The phrase "or similar to" should be construed fairly broadly there, since the introduction of a line break does not always automatically imply the insertion of a hyphen *glyph* at the end of each line so broken. The Unicode Standard itself has something to say on this. See page 6-5: "U+00AD SOFT HYPHEN indicates a hyphenation point, where a line- break is preferred when a word is to be hyphenated. Depending on the script, the visible rendering of this character when a line break occurs may differ (for example, in some scripts it is rendered as a *hyphen* -, while in others it may be invisible)." For 8859-1 taken only in its own context, where the 8859-1 characters are sufficient only for representation of a subset of the Latin script, the assertion that the SOFT HYPHEN images as a hyphen is basically correct. But 8859-1 is now firmly embedded in the context of ISO/IEC 10646-1, and the usage of SOFT HYPHEN must now be considered in the universal character set for multiple scripts, most of them not Latin-based. --Ken Whistler
Received on Monday, 12 May 1997 16:20:23 UTC