Re: non-sgml characters from David Woolley on 2002-07-16 (w3c-wai-ig@w3.org from July to September 2002)

From: David Woolley <david@djwhome.demon.co.uk>
Date: Tue, 16 Jul 2002 22:44:47 +0100 (BST)
To: w3c-wai-ig@w3.org
Message-Id: <200207162144.g6GLil301753@djwhome.demon.co.uk>

> Any tool that allows us to convert big5 formatted text to UTF-8 text? 

Effectively any modern browsers does this (except probably UTF-16, 
rather than UTF-8).  The main thing to remember is to do what the
standards have required since HTTP 1.0, but is very often forgotten -
identify the character set with the page.

For all browsers, you can do this using the charset parameter in the
real HTTP Content-Type header.  The default for text/html is iso-8859-1,
however, current best practice, enforced by the W3C's validator, is
never to let it default.

You can use this for any text/ format.  For post HTML 4.0 browser, you
can also use meta elements include a copy of that header; the real HTTP
header takes precedence, if it specifies a character set.

It has become common practice to treat no character set as meaning the
character set of the country in which the page was authored, but this
is wrong; it results in Japanese displaying a gibberish European accented
characters, or the browser having to do character frequency based heuristics
to guess what was really meant.

It is possible that some very old browsers react inappropriately to
this.  These browsers were probably never intended for use outside
the US market, but may have been adapted by bolt on software that 
re-interprets the characters as CJK ones.

Received on Tuesday, 16 July 2002 17:48:49 UTC