Re: HTML - i18n / NCR & charsets from Misha Wolf on 1996-11-27 (www-international@w3.org from October to December 1996)

From: Misha Wolf <MISHA.WOLF@reuters.com>
Date: Wed, 27 Nov 1996 12:39:28 -0500 (EST)
To: www-html <www-html@w3.org>, www-international <www-international@w3.org>, Unicode <unicode@unicode.org>
Message-Id: <6028391227111996/A33076/RE6/11ABDB271B00*@MHS>

We have three representations:
(a)  raw octets
(b)  numeric character references
(c)  entity names.

Numeric character references are, of course, supposed to refer to Unicode/
ISO 10646.

The charset, whether specified via HTTP or HTML or a menu, should affect 
the interpretation of (a).  It should *not* affect the interpretation of 
(b) or (c).  The major browsers were broken in this regard and are being 
gradually fixed.

An example of a "cheesy little editor" that created lots of polluted Web 
pages was FrontPage 1.0.  Though Microsoft sold it as suitable only for 
Code Page 1252, lots of people used it on other Code Pages.  FP 1.0 simply 
exports stuff as if it were CP 1252, hence a Russian Web page ends up full 
of Latin 1 entity names!  FP 2.0 (aka 97) has, I believe, fixed this.

The various Internet Assistants did the same foul thing.  I hope they've 
been fixed.

The pages created using these tools will presumably (?) get fixed when 
their authors pass them through the new versions of the tools.  Can anyone 
confirm/deny this?

Misha

Received on Wednesday, 27 November 1996 08:02:16 UTC