Re: Japanese encoding? from Erik van der Poel on 2000-02-20 (www-international@w3.org from January to March 2000)

From: Erik van der Poel <erik@netscape.com>
Date: Sun, 20 Feb 2000 11:09:46 -0800
To: Shoshannah Forbes <xslf@xslf.com>
CC: www-international@w3.org, drorit@nana.co.il
Message-ID: <38B03BFA.570A8C8@netscape.com>
Shoshannah Forbes wrote:
> 
> I was asked about Japanese web page encoding- where can I find
> information about what encodings are available, what are the
> differences etc?

Olin gave a link to a very old document by Ken Lunde. There is a much
newer document from the same author, a book called CJKV Information
Processing:

  http://www.oreilly.com/catalog/cjkvinfo/

However, that may be too much info for you. Here is a brief summary:
There are 3 main character encodings for Japanese. Shift-JIS, EUC-JP and
ISO-2022-JP. Shift-JIS is the encoding used in Japanese Windows and
MacOS systems. Unix systems have traditionally used EUC-JP, but recent
Unixes support Shift-JIS too. ISO-2022-JP was originally designed to be
used on the Net (email and netnews).

On the Web, it doesn't really matter which one you use. The popular
browsers (IE and Nav) support "Japanese auto-detect", and this is the
default setting in the localized versions of the browsers. If you have
lots of Japanese text in your document, the auto-detect will probably
work reliably, but it is always better to label your documents
correctly, especially if you have little Japanese text in the document,
causing the auto-detect to fail.

The best way is to use the HTTP Content-Type header:

  Content-Type: text/html; charset=shift_jis

The name "shift_jis" was implemented more recently than "x-sjis", which
was used from Navigator 1.1 onwards. So if you want to guarantee that it
works even in old browsers, "x-sjis" is better, but very few people use
such old browsers. Similarly, the old name for EUC-JP is "x-euc-jp",
while the new name is "euc-jp". The name "iso-2022-jp" has been in use
since the beginning (Nav 1.1, early 1995).

If you are not able to alter the HTTP Content-Type header, that would be
very unfortunate, and I suggest that you try very hard to alter it. Let
me know if you have difficulty with this. I want to stamp out such
problems on the Net.

If you really can't add the HTTP charset, you can use the HTML META
charset fallback:

  <meta http-equiv="Content-Type" content="text/html; charset=euc-jp">

Alternatively, if you believe that the master of Web content, Yahoo, has
perfected this, you might like to take a look at their method. Try
typing www.yahoo.co.jp into my HTTP/HTML source viewer:

  http://webtools.mozilla.org/web-sniffer/

You will notice that they use the EUC-JP code "\xFD\xFE" near the
beginning of their Japanese documents. This causes the browsers'
auto-detect routines to realize that the page is in EUC-JP, since that
character code does not occur in Shift-JIS and ISO-2022-JP. They may
have decided on this approach because some Netscape versions had bugs in
the META charset handling that caused documents to be drawn twice even
if it was correct the first time. Most Japanese users have their charset
menu set to Japanese Auto-Detect (or Universal Auto-Detect in MSIE5), so
Yahoo Japan works for them.

There are 3 main character sets represented in the encodings: ASCII, JIS
X 0201 and JIS X 0208. All of these are adequately covered by Shift-JIS,
EUC-JP and ISO-2022-JP. However, if you need access to the supplementary
set of Japanese characters, JIS X 0212, then you need EUC-JP (or
ISO-2022-JP, though JIS X 0212 is not strictly allowed here).

Of course, the other major encodings for such extended Japanese use are
UTF-8 and the other Unicode-based charsets. However, there are known
bugs in Netscape's handling of UTF-8 for Japanese, so this is not
advisable. Hopefully, Mozilla will correct this problem, and users will
upgrade to better browsers in the future.

Erik
Received on Sunday, 20 February 2000 14:12:13 UTC