- From: Erik van der Poel <erik@netscape.com>
- Date: Sun, 20 Feb 2000 11:09:46 -0800
- To: Shoshannah Forbes <xslf@xslf.com>
- CC: www-international@w3.org, drorit@nana.co.il
Shoshannah Forbes wrote: > > I was asked about Japanese web page encoding- where can I find > information about what encodings are available, what are the > differences etc? Olin gave a link to a very old document by Ken Lunde. There is a much newer document from the same author, a book called CJKV Information Processing: http://www.oreilly.com/catalog/cjkvinfo/ However, that may be too much info for you. Here is a brief summary: There are 3 main character encodings for Japanese. Shift-JIS, EUC-JP and ISO-2022-JP. Shift-JIS is the encoding used in Japanese Windows and MacOS systems. Unix systems have traditionally used EUC-JP, but recent Unixes support Shift-JIS too. ISO-2022-JP was originally designed to be used on the Net (email and netnews). On the Web, it doesn't really matter which one you use. The popular browsers (IE and Nav) support "Japanese auto-detect", and this is the default setting in the localized versions of the browsers. If you have lots of Japanese text in your document, the auto-detect will probably work reliably, but it is always better to label your documents correctly, especially if you have little Japanese text in the document, causing the auto-detect to fail. The best way is to use the HTTP Content-Type header: Content-Type: text/html; charset=shift_jis The name "shift_jis" was implemented more recently than "x-sjis", which was used from Navigator 1.1 onwards. So if you want to guarantee that it works even in old browsers, "x-sjis" is better, but very few people use such old browsers. Similarly, the old name for EUC-JP is "x-euc-jp", while the new name is "euc-jp". The name "iso-2022-jp" has been in use since the beginning (Nav 1.1, early 1995). If you are not able to alter the HTTP Content-Type header, that would be very unfortunate, and I suggest that you try very hard to alter it. Let me know if you have difficulty with this. I want to stamp out such problems on the Net. If you really can't add the HTTP charset, you can use the HTML META charset fallback: <meta http-equiv="Content-Type" content="text/html; charset=euc-jp"> Alternatively, if you believe that the master of Web content, Yahoo, has perfected this, you might like to take a look at their method. Try typing www.yahoo.co.jp into my HTTP/HTML source viewer: http://webtools.mozilla.org/web-sniffer/ You will notice that they use the EUC-JP code "\xFD\xFE" near the beginning of their Japanese documents. This causes the browsers' auto-detect routines to realize that the page is in EUC-JP, since that character code does not occur in Shift-JIS and ISO-2022-JP. They may have decided on this approach because some Netscape versions had bugs in the META charset handling that caused documents to be drawn twice even if it was correct the first time. Most Japanese users have their charset menu set to Japanese Auto-Detect (or Universal Auto-Detect in MSIE5), so Yahoo Japan works for them. There are 3 main character sets represented in the encodings: ASCII, JIS X 0201 and JIS X 0208. All of these are adequately covered by Shift-JIS, EUC-JP and ISO-2022-JP. However, if you need access to the supplementary set of Japanese characters, JIS X 0212, then you need EUC-JP (or ISO-2022-JP, though JIS X 0212 is not strictly allowed here). Of course, the other major encodings for such extended Japanese use are UTF-8 and the other Unicode-based charsets. However, there are known bugs in Netscape's handling of UTF-8 for Japanese, so this is not advisable. Hopefully, Mozilla will correct this problem, and users will upgrade to better browsers in the future. Erik
Received on Sunday, 20 February 2000 14:12:13 UTC