- From: Řistein E. Andersen <liszt@coq.no>
- Date: Tue, 14 Apr 2009 01:14:25 +0100
This e-mail is an attempt to give a relatively concise yet reasonably complete overview of non-Unicode character sets and encodings for `Chinese characters', excluding those which are not supported by at least one of the four browsers IE, Safari, Firefox and Opera (henceforth `all browsers'), and tentatively avoiding technical details which are out of scope for HTML5 unless they are important to gain a general understanding of the relevant issues. To avoid unnecessary confusion, the following three concepts are kept distinct: 1) Character set: A collection of characters, typically defined as a matrix with 94 rows and 94 columns. (A character set with more than one matrix is said to have multiple planes.) The ones officially registered `for use with escape sequences' (typically in ISO-2022 encodings, see below) can be found at <http://www.itscj.ipsj.or.jp/ISO-IR/overview.htm >. 2) Encoding: Defines how a given character (typically defined by its row and column numbers) from a given character set can be encoded as a sequence of bytes. All the encodings discussed below allow multiple character sets to be encoded. [ISO-2022 encodings use only 7-bit bytes and employ escape sequences to switch between different character sets. EUC encodings use bytes < 128 for ASCII (or something similar) and bytes >= 128 to encode other character sets.] 3) MIME charset string: This is the string used, e.g., in a HTTP Content-Type header to indicate the *encoding*. Many of these can be found at <http://www.iana.org/assignments/character-sets>. Some information about browser support for specific character sets, encodings and MIME charset strings can be found at <http://coq.no/character-tables/mime/iso-2022/en >, <http://coq.no/character-tables/mime/euc/en> and <http://coq.no/character-tables/mime/locale-specific/en >. The notation a < b means that a is a proper subset of b; a and b can be either character sets or encodings. ****************************************** * What should HTML 5 say about all this? * ****************************************** This section gives a summary of superset encodings which are either universally supported or potentially needed for compatibility. (Anyone who is going to read the entire e-mail will probably prefer to read the sections *Chinese*, *Japanese* and *Korean* at this point and return to this section afterwards.) Superset encodings (stricto sensu) ---------------------------------- HTML5 currently contains a table of encodings aliases, of which the following involve Chinese characters: 1) EUC-KR -> Windows-949 2) GB2312 -> GBK 3) GB_2312-80 -> GBK 4) KS_C_5601-1987 -> Windows-949 5) x-x-big5 -> Big5 EUC-KR < Windows-949, and all browsers do 1), so this is reasonable and probably needed. GB2312 and GB_2312-80 technically refer to the *character set* GB 2312-80, which can be expressed not only in EUC-CN encoding, but also in ISO-2022-CN encoding and HZ encoding. GBK, on the other hand, is an encoding. EUC-CN < GBK. It would be more correct to remove 2) and 3) and instead add: EUC-CN -> GBK Admittedly, EUC-CN is sometimes called `8-bit GB encoding', and registered MIME charset strings include GB_2312-80 and GB_2312-80 as distinct entries (but not EUC-CN), so a note to this effect might be appropriate. (Additionally, GBK is slightly ambiguous, so make sure not to reference an incomplete or outdated version without pointing out necessary amendments/additions.) Similarly, EUC-KR is sometimes referred to as `eight-bit KS' or `KS_C_5601-1987', which Ken Lunde characterises as `incorrect and dangerous' in his book /CJKV Information Processing/. It would be more correct to remove 4). Unlike EUC-CN, EUC-KR is a registered MIME charset string, but KS_C_5601-1987 has a distinct entry, so a note might again be appropriate. As for 5), the MIME charset string x-x-big5 does indeed correspond to Big5 encoding (or rather an extension thereof) in all browsers but Opera. There is a large number of unregistered charset strings, however, and the other mappings in this table are between encodings. Unless x-x-big5 is actually supposed to refer to an encoding distinct from Big5, 5) should be removed. Instead (depending on the reference ultimately given for Big5), it may be necessary to note that at least certain ETen extensions should be regarded as part of Big5. In addition, Shift_JIS < Windows-31J, and all browsers implement this mapping, so the following should be added: Shift_JIS -> Windows-31J Further superset encodings (probably not needed) ------------------------------------------------ ISO-2022-CN < ISO-2022-CN-EXT This is reasonable, but probably not necessary: Firefox does it, Safari does not, Opera does not implement the superset, IE does not even implement the subset. Distinguishing between them is pointless. EUC-CN < GBK < GB18030 The first step is probably sufficient, and the second is potentially problematic if an incompatible extension to GBK were to be invented. ISO-2022-JP < ISO-2022-JP-1 < ISO-2022-JP-2 < ISO-2022-JP-3 < ISO-2022- JP-2004 No browser attempts to distinguish between these, which would be completely meaningless. On the other hand, IE only implements ISO-2022-JP, and only Firefox implements ISO-2022-JP-2, so these may not actually be necessary. Shift_JIS_X0213-2000 < Shift_JIS-2004 Safari arguably does this, and there is no need to make a distinction between them, but no browser seems to implement either in a meaningful way at the moment. Superset *character sets* (universally recognised) -------------------------------------------------- JIS C 6226-1978 < JIS X 0208-1983 < JIS X 0208-1990/1997 Whenever one of the subsets are referred to in any variety of ISO-2022- JP, the superset is used instead. JIS X 0208-1990/1997 should be understood as including NEC and IBM extensions. This character set is part of all varieties of ISO-2022- JP, as well as EUC-JP and Shift-JIS. KS X 1001:1992 < KS X 1001:1998 < KS X 1001:2002 Only three characters have been added in total. All but Safari have implemented the two characters added in 1998. This character set is part of ISO-2022-KR, EUC-KR and Johab. Other additions to ISO-2022 encodings (potentially essential) ------------------------------------------------------------- All varieties of ISO-2022-JP must include the Katakana character set which was not officially added to the standard until ISO-2022-3. The escape sequence for Swedish should be accepted as a synonym for JIS-Roman. (IE furthermore allows to select katakana using shift-in/out.) All these extensions were originally defined in the older JIS encoding, which predates ISO-2022-JP. 8-bit bytes in 7-bit encodings ------------------------------ IE interprets 8-bit bytes (i.e., octets with the high bit set) in 7- bit encodings as if they had occurred in an 8-bit encoding of the same language, viz: HZ-GB-2312 -> GBK ISO-2022-JP -> Shift-JIS ISO-2022-KR -> Windows-949 Other browsers (at least Safari and Opera) sometimes ignore the specified MIME charset string and try to detect/sniff the encoding instead, which is prone to error and no less `wrong'. I would suggest other browsers to support the mappings above, which should hopefully enable them to trust the MIME charset string. *** The remainder of this e-mail gives further details about character sets (single underline) and encodings (double underline), divided into three sections according to the language for which they are intended (Chinese, Japanese and Korean). *********** * Chinese * *********** Character sets for simplified Chinese characters ------------------------------------------------ GB2312-80 < GB 6345.1-86 < ISO-IR-165:1992 GB2312-80 < GB 8565.2-88 < ISO-IR-165:1992 (It follows that GB 6345.1-86 and GB 8565.2-88 have no conflicting assignments.) Most browsers support only GB2312-80. Safari supports ISO-IR-165:1992 as well, but the two are kept distinct. Character sets for traditional Chinese characters ------------------------------------------------- CNS 11643-1992: Plane 1 and plane 2 defined in 1986. Plane 14 added in 1988. Plane 15 added in 1988. In 1992, plane 3 was defined as the first part of plane 14, the remainder of plane 14 was put into plane 4, many of the characters from plane 15 were added to planes 4--7, other characters were added to planes 4--7, and planes 14 and 15 were removed; the result was seven planes, 1--7. HZ encoding for simplified Chinese ================================== HZ-GB-2312 supports: - ASCII ? GB2312-80 IE furthermore allows GB2312-80 encoded as in EUC-CN, as well as GBK extensions (8-bit). ISO-2022 encoding for traditional and simplified Chinese ======================================================== ISO-2022-CN supports: - ASCII - GB2312-80 - CNS 11643-1992, planes 1 and 2 ISO-2022-CN-EXT supports in addition: - ISO-IR-165 - CNS 11643-1992, planes 3--7 - (theoretically, further character sets, but which cannot be selected because escape sequences have not been allocated) IE does not support ISO-2022 for Chinese. ISO-2022-CN-EXT is implemented in Safari (complete) and Firefox (missing ISO-IR-165). ISO-2022-CN < ISO-2022-CN-EXT Firefox treats ISO-2022-CN as ISO-2022-CN-EXT, whereas Safari does not. There does not seem to be any reason not to. EUC encoding for simplified Chinese and extensions thereof ========================================================== EUC-CN supports: - ASCII - GB2312-80 GBK adds in particular all Chinese characters in Unicode 1.1 not included in GB2312-80. GB18030 adds all remaining Unicode characters. EUC-CN < GBK < GB2312-80 Windows-936 is very similar to GBK and probably the only variant implemented in browsers. Windows-936 includes a few characters in addition to GBK; conversely, GBK apparently includes some characters not in Windows-936, at least not originally. GBK should probably refer to Windows-936, possibly with later additions (I have yet to see an official GBK specification). All browsers (except Firefox) treat EUC-CN as GBK/Windows-936. Firefox instead treats EUC-CN as GB18030, keeping GBK/Windows-936 apart. Only Safari supports Mac-specific additions to EUC-CN called MacOS-S; IE and Opera handles this as pure EUC-CN, which is a fairly good fall- back mechanism. EUC encoding for traditional Chinese ==================================== EUC-TW supports: - ASCII - CNS 11643-1992, planes 1--7 It may previously have included: - CNS 11643-1992, planes 14 and 15 DEC Hanyu provides a different (8-bit) encoding for: - CNS 11643-1992, planes 2--4 All browsers support ASCII and CNS 11643-1992, plane 1 (albeit IE, Safari and Firefox each require a different MIME charset string!). Safari, Firefox and Opera support CNS 11643-1992, plane 2 encoding according to EUC-TW; IE instead supports it when encoded as DEC Hanyu. Opera supports plane 14; Firefox supports planes 3--7. EUC-TW and DEC Hanyu are not conflicting, so it would be possible to support planes 2--4 (or at least plane 2) according to both standards. Plane 1 can already be encoded in two different ways according to EUC- TW (and Opera supports both), so this does not really add any problems. Similarly, supporting planes 14 and 15 as well as planes 2--7 is completely unproblematic. However, the current degree of incompatibility between browsers would seem to suggest that EUC-TW is not a very popular encoding. Big5 encoding for traditional Chinese ===================================== Big5 is (roughly) an encoding that supports: - ASCII - CNS 11643-1992, planes 1 and 2 (Historically, Big5 predates CNS 11643-1992) Extensions include: - ETen - MacOS-T - Hong Kong extensions - Big5+ - Big5E - Big5-2003 - Unicode-At-On All browsers support some ETen extensions; only IE does not support them all. ETen and MacOS-T extensions are compatible, and IE supports both (given the MIME charset string referring to MacOS-T), but Safari does not and this is almost certainly not needed. Hong Kong extensions are incompatible with ETen extensions, so a separate MIME charset string is needed to activate Hong Kong extensions. Big5 < Big5+ Big5 < Big5E Big5 < Big5-2003 However, these three extensions are all incompatible, and at least some of them are incompatible with other extensions. Big5+ and the later, smaller Big5E are not implemented in browsers, as far as I can tell. Firefox adds characters from Big5-2003 and (according to bug reports) Unicode-At-On. I have not found an authoritative Big5-2003 specification, but handling Big5 as Big5-2003 (adding at least ETen extensions if they are not part of Big5-2003 already) might be a good idea. ETen encoding for traditional Chinese ===================================== ETen is an encoding that supports: - ASCII - CNS 11643-1992, planes 1 and 2 - ETen extensions Only IE supports this particular encoding. ************ * Japanese * ************ Character sets for Japanese characters -------------------------------------- JIS X 0201 (Katakana) JIS C 6226-1978 JIS X 0208-1983 JIS X 0208-1990/1997 JIS X 0212-1990 JIS X 0213-2000 Plane 1 JIS X 0213-2000 Plane 2 JIS X 0213-2004 Plane 1 JIS C 6226-1978 < JIS X 0208-1983 < JIS X 0208-1990/1997 < JIS X 0213-2000 Plane 1 < JIS X 0213-2004 Plane 1 (There are a few incompatible changes, but those should officially be regarded as `corrections'.) Characters from JIS X 0212-1990 were included in JIS X 0213-2000 Plane 1. There is also a Japanese ASCII variant (JIS Roman) with yen and macron instead of backslash and tilde. However, IE makes no distinction between ASCII and JIS Roman, but uses a hybrid if either is needed. IE furthermore shows a yen symbol for \. ISO-2022 encoding for Japanese ============================== ISO-2022-JP < ISO-2022-JP-1 < ISO-2022-JP-2 < ISO-2022-JP-3 < ISO-2022- JP-2004 JIS is a precursor for ISO-2022-JP. No browser distinguishes between any of these encodings. The following lists the character sets that can be encoded in different variants of ISO-2022 according to the specifications. ISO-2022-JP: - ASCII - JIS Roman - JIS C 6226-1978 - JIS X 0208-1983 ISO-2022-JP-1 adds: - JIS X 0212-1990 ISO-2022-JP-2 adds: - GB 2312-80 (Chinese) - KS X 1001 (Korean) - ISO 8859-1 (Western-European) - ISO 8859-7 (Monotonic Greek) ISO-2022-JP-3 adds: - Katakana - JIS X 0213-2000 Plane 1 - JIS X 0213-2000 Plane 2 ISO-2022-JP-2004 adds: - JIS X 0213-2004 Plane 1 In practice, the situation is rather different: The escape sequences reserved for JIS C 6226-1978 and JIS X 0208-1983 instead selects the superset JIS X 0208-1990/1997, whose escape sequence is not recognised. IE incorrectly selects JIS X 0208-1990/1997 also when the escape sequence for JIS X 0212-1990 is used, but the two are completely incompatible. I have no idea whether it is common to use the wrong escape sequence in this particular case. Only Firefox supports the non-Japanese character sets added in ISO-2022-JP-2. No browser supports JIS X 0213 (in ISO-2022 encoding). Only Safari does not include IBM extensions, in both NEC and to the extent possible IBM (non-Shift-JIS) positions. IE furthermore interprets 8-bit characters as Shift-JIS and allows shift-in/shift-out control characters to indicate Katakana, as defined in the earlier JIS standard. Other browsers might want to add this. (Some other IE extensions are completely insane and almost certainly not needed for compatibility.) The escape sequence reserved for 7-bit Swedish (which is not included in any ISO-2022-JP variant) must instead select JIS Roman. EUC encoding for Japanese ========================= EUC-JP supports: - ASCII - JIS X 0208-1990/1997 - Katakana - JIS X 0212-1990 IE and Safari does not support JIS X 0212-1990. IBM extensions in NEC and to the extent possible IBM (non-Shift-JIS) positions are universally supported (except for Safari, which does not support NEC positions). Shift-JIS encoding for Japanese =============================== Shift-JIS supports: - ASCII - Katakana - JIS X 0208-1990/1997 All browsers furthermore supports NEC symbols as well as IBM extensions in both NEC and IBM (Shift-JIS) positions. This is actually Windows-932: Shift-JIS < Windows-932 There are also other extensions, incompatible with Windows-932: Shift-JIS < Shift-JIS X0213 < Shift-JIS-2004 Shift-JIS X0213 adds: - Shift_JISX0213-2000 plane 1 - Shift_JISX0213-2000 plane 2 Shift-JIS-2004 adds instead: - Shift_JISX0213-2004 plane 1 - Shift_JISX0213-2000 plane 2 (same as previous encoding) Safari supports the latter, but I have not yet found a MIME charset string which triggers it. (Surprisingly and somewhat stupidly, Shift_JIS_X0213-2000 triggers Windows-932 in Safari, whereas no other browser even supports this string.) ********** * Korean * ********** Character sets for Korean characters ------------------------------------ KS X 1001:1992 Two characters were added in 1998, and another in 2002. Only Safari does still not support the additions from 1998. Hangul syllables which are not included in precomposed form can be encoded as 8-byte sequences, 2 bytes for for each of the following: specific `composition' code, initial consonant, medial vowel, final consonant. This is not supported unless noted otherwise below. (Not actually tested for Johab, for which it is irrelevant.) IE uses a ASCII/KS-Roman hybrid with won instead of backslash (when compared to ASCII) and furthermore displays won for \. ISO-2022 encoding for Korean ============================ ISO-2022-KR supports: - ASCII - KS X 1001:1992 Safari displays won instead of backslash (as IE does it for all encodings). IE treats 8-bit characters as Windows-949. EUC encoding for Korean ======================= EUC-KR supports: - ASCII - KS X 1001:1992 Firefox supports 8-byte Hangul encoding. Only Safari does not support the Microsoft UHC extension (which adds all missing precomposed hangul). The combination is also known as Windows-949. Only Safari supports the Mac-specific HangulTalk extensions. EUC-KR < Windows-949 EUC-KR < HangulTalk Johab encoding for Korean ========================= EUC-KR supports: - ASCII - KS X 1001:1992 (non-hangul) - All possible hangul (including those in KS X 1001:1992) This encoding contains the same characters as Windows-949, but arranged more systematically. Unfortunately, the encoding is not compatible with EUC-KR. Opera does not support Johab. Safari does not render my test page at all. -- ?istein E. Andersen
Received on Monday, 13 April 2009 17:14:25 UTC