- From: NARUSE, Yui <naruse@airemix.jp>
- Date: Sun, 27 Sep 2009 00:03:55 +0900
Anne van Kesteren wrote: > On Sun, 30 Aug 2009 03:47:34 +0200, Ian Hickson <ian at hixie.ch> wrote: >> I've backed off UTS22. I think we need the IANA list updated, though, to >> include the aliases browsers support. I understand you are working on >> this? I would like to remove the table in the HTML5 spec that defines >> such mappings, once that is done. > > Part of the alias table is apparently incorrect. I will be working on > registering the required aliases though, yes, once some more research is > complete. This will however not solve at least the following two problems: > > * Some encodings need to be decoded (and encoded) using another > encoding. (The other table HTML5 contains.) > * The standards for encodings do not always match the required > implementation of the encoding. Apparently just like with anything else > encoding standards do not match reality. > > (Initially it also seemed to be a problem to register encodings with an > "x-" prefix, but I think we're past that now, though of course we can't > be sure until it actually succeeds.) As far as I know, all majour Japanese encodings have this problem. And some other encodings also have this. You know, IE's Shift_JIS implementation is Windows-31J. And other majour Web Browsers follow this. http://www.microsoft.com/globaldev/reference/dbcs/932.htm NOTE: By IANA Charsets, 7bit area is defined as JIS X0201:1997. But actual Windows-31J/CP932 is mapped its 0x5C to U+005C; and Japanese Windows Font uses Yen Sign Glyph for U+005C. This problem include Tilde Overline. You may know EUC-JP, another majour Japanese encoding. IANA Charsets defines following: code set 0: US-ASCII (a single 7-bit byte set) code set 1: JIS X0208-1990 (a double 8-bit byte set) restricted to A0-FF in both bytes code set 2: Half Width Katakana (a single 7-bit byte set) requiring SS2 as the character prefix code set 3: JIS X0212-1990 (a double 7-bit byte set) restricted to A0-FF in both bytes requiring SS3 as the character prefix But IE's EUC-JP implementation called CP51932 is http://reddog.s35.xrea.com/wiki/cp51932.enc.html code set 0: US-ASCII (a single 7-bit byte set) code set 1: JIS X0208-1990 (a double 8-bit byte set), NEC special characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119) restricted to A0-FF in both bytes code set 2: Half Width Katakana (a single 7-bit byte set) requiring SS2 as the character prefix code set 3: not supported current Mozilla's is CP51932 and JIS X 0212 mixed encoding. (in bug 5184 of Bugzilla-jp, they are going to CP51932) http://bugzilla.mozilla.gr.jp/show_bug.cgi?id=4873 (in Japanese) http://bugzilla.mozilla.gr.jp/show_bug.cgi?id=5184 (in Japanese) Chrome is the same as Mozilla http://code.google.com/p/chromium/issues/detail?id=3094 Webkit/Safari is of course almost same as Chrome, but it does strange replacement. https://bugs.webkit.org/show_bug.cgi?id=24906 http://code.google.com/p/chromium/issues/detail?id=9696 Chrome doesn't I think HTML5's EUC-JP should be CP51932. ISO-2022-JP (CP50220/CP50221/CP50222) has the same problem. IANA Charsets defines Big5, but it doesn't say what is the "Big5". IE's Big5 is CP950. Mozilla uses its original table. Its decoding is CP950, Big5-2003 and UAO mixed table, and encoding is CP950. https://bugzilla.mozilla.org/show_bug.cgi?id=310299 http://moztw.org/docs/big5/ -- NARUSE, Yui <naruse at airemix.jp>
Received on Saturday, 26 September 2009 08:03:55 UTC