Re: Problem in publishing multilingual HTML document on web in UTF-8 encoding from David Woolley on 2006-06-05 (www-html@w3.org from June 2006)

From: David Woolley <david@djwhome.demon.co.uk>
Date: Mon, 5 Jun 2006 23:08:41 +0100 (BST)
To: www-html@w3.org
Message-Id: <200606052208.k55M8ft01187@djwhome.demon.co.uk>

> What has been in IE has been there for years...when the computing world
> was based on code pages and system locales instead of Unicode. Actually,
> that has only been some 5-7 years ago. 

HTML wasn't.  It's internal code page was ISO 8859/1 and that was also
the default code page for HTTP.  The problem was:

1) browsers actually treated both as being the recipient's platform's
   code page, so you got totally bogus entities, like &#x9a;, because
   browsers actually used Windows-1252.

2) ISO 8859/1 is USA and Western European chavinistic, so there was
   no way for people in the rest of the world to create valid web pages -
   even specifying gb2312 in the HTTP header didn't remove the fact that
   you couldn't represent Chinese in the HTML internal character set
   (one result was that people actually used two numeric entities to
   represent one character!).

HTML 4 extends to ISO 10646 and makes specifying the transfer character
set a SHOULD (or is it a MUST), but browsers still have to cope with
legacy pages.  Character set, though, is rather technical for ordinary
users, but using UTF-8 for everything bloats pages, although that is
the default for true XHTML (not the Appendix C stuff that started this
thread).

So, the original situation was that there was an explicit default, but
it was inadequate, and the current situation is that character set
should always be specified.

> Based on users needing to view pages and an ability to control the
> quality of pages that a page author may generate, the best solution for
> customers is help them view the page...even if the author or tool did
> not put in the character set used.

I thought this was supposed to be one of the main reasons why the vast
majority of HTML is bad.  Authors author for the intended result on 
the current version of IE, not to the standards.

> used (hopefully defaulting to UTF-8) and then to educate authors who are
> generating content to check that their pages are written correctly.

All attempts to educate people to even use validator.w3.org have 
essentially failed.  It is generally only amateurs who produce valid
HTML.

Received on Monday, 5 June 2006 22:13:02 UTC