Re: m-dashes from Foteos Macrides on 1998-02-09 (w3c-wai-ig@w3.org from January to March 1998)

From: Foteos Macrides <MACRIDES@SCI.WFBR.EDU>
Date: Mon, 09 Feb 1998 14:11:00 -0500 (EST)
To: DPawson@rnib.org.uk
Cc: w3c-wai-ig@w3.org
Message-id: <01ITDN12B4HE00CDIW@SCI.WFBR.EDU>
"Pawson, David" <DPawson@rnib.org.uk> wrote:
>> to follow up on what Charles said:
>> 
>> > Please refer me to exactly what needs to be corrected in the
>> > next version of Internet Explorer.  Thanks,
>> 
>> There is also an issue with the programs that originate, HTML, as
>> opposed to interpreting in.  That is to say, don't represent an
>> &mdash; as &#151, etc, but use the SGML entity names or ISO
>> character numbers for them.
>> 
>> Al Gilman
>	[Pawson, David]  
>
>	Surely the simple need is for IE4.xx and netscape to implement
>ISO latin 1?  [I.e. be capable of displaying the correct glyphs for
>each entity in the set].  In deference to those out there who only live
>in MSDOS, perhaps this should be a switchable option?
>
>	My logic says that the html generator programs will follow the
>browsers fairly rapidly. I.e. the html editor software programs. Is
>that reasonable?
>
>	My only real concern is that the single ISO latin 1 is only one
>of a number needed for true internationalisation. A Unicode shift would
>give a real move forward, permitting a wider use of the other entity sets.

	The codepages 850 for DOS and 1252 for Windows adequately (IMHO)
encompass the Latin 1 (iso-8859-1) character set.  The problem has two
aspects:

	(1) The HTML editor software programs are not respecting that the
	    values of numeric character references are for the HTML Document
	    Character Set, which it iso-10646 (essentially, Unicode) as of
	    HTML 4.0, and iso-8859-1 (a subset of iso-10646) in previous HTML
	    specs, and are generating numeric character references in the
	    range reserved for control characters -- disallowed for HTML --
	    but corresponding to intended characters such as fancy dashes
	    and quotation marks in the Windows codepage;

	(2) that the browsers are treating these character references
	    in that range as references to the corresponding characters
	    in the Windows codepage, rather than as disallowed values
	    in the HTML Document Character Set.

Both aspects of the problem need to be addressed simultaneously, or you
are likely to find yourself in the position of hoping that the tail can
wag the dog.

	The HTML 4.0 specs now include named character references for
all of the characters which are presently being handled via invalid
numeric character references (except smiling face :).  Here is a list
of the invalid nurmeric character references being encountered on today's
Web, and their correct numeric (in hex notation) and named references:


        Conversions of invalid numeric (MicroSoft codepage)
        character references to valid Unicode numeric or named
        character reference (names as in HTML 4.0).

INVALID     Numeric   Named             Character
-------     -------- -------   -----------------------------------------
&#1;    ->  &#x263a; (none)    WHITE SMILING FACE
&#130;  ->  &#x201a; &sbquo;   SINGLE LOW-9 QUOTATION MARK
&#132;  ->  &#x201e; &bdquo;   DOUBLE LOW-9 QUOTATION MARK
&#133;  ->  &#x2026; &hellip;  HORIZONTAL ELLIPSIS
&#134;  ->  &#x2020; &dagger;  DAGGER
&#135;  ->  &#x2021; &Dagger;  DOUBLE DAGGER
&#137;  ->  &#x2030; &permil;  PER MILLE SIGN
&#139;  ->  &#x2039; &lsaquo;  SINGLE LEFT-POINTING ANGLE QUOTATION MARK
&#145;  ->  &#x2018; &lsquo;   LEFT SINGLE QUOTATION MARK
&#146;  ->  &#x2019; &rsquo;   RIGHT SINGLE QUOTATION MARK
&#147;  ->  &#x201c; &ldquo;   LEFT DOUBLE QUOTATION MARK
&#148;  ->  &#x201d; &rdquo;   RIGHT DOUBLE QUOTATION MARK
&#149;  ->  &#x2022; &bull;    BULLET
&#150;  ->  &#x2013; &ndash;   EN DASH
&#151;  ->  &#x2014; &mdash;   EM DASH
&#152;  ->  &#x02dc; &tilde;   SMALL TILDE
&#153;  ->  &#x2122; &trade;   TRADE MARK SIGN
&#155;  ->  &#x203a; &rsaquo;  SINGLE RIGHT-POINTING ANGLE QUOTATION MARK


	As I noted in a previous message, for accessibility reasons
Lynx 2.7.2 is performing the above conversions of invalid numeric
character references, but that's a catch-22.  We're now seeing
messages like this from people who rely on the empirical behavior
of browsers instead of understanding and complying with the standards
for interoperability:

	"We gee, &#145; and &#146 get me the quotation marks in Lynx,
	 and it's developers are fussy about standards, so that must
	 be OK."

Sigh... &#x263a;

	Note that a number of smiley and frowney characters are available
in iso-10646, and I hope the named entities for them are added to the HTML
specs soon, because the present situation on the Web is a serious threat
to world peace. :)

				Fote

=========================================================================
 Foteos Macrides            Worcester Foundation for Biomedical Research
 MACRIDES@SCI.WFBR.EDU         222 Maple Avenue, Shrewsbury, MA 01545
=========================================================================
Received on Monday, 9 February 1998 14:15:06 UTC