- From: Ian Hickson <ian@hixie.ch>
- Date: Wed, 3 Jun 2009 22:19:05 +0000 (UTC)
On Tue, 14 Apr 2009, ?istein E. Andersen wrote: > > This e-mail is an attempt to give a relatively concise yet reasonably complete > overview of non-Unicode character sets and encodings for `Chinese characters', > excluding those which are not supported by at least one of the four browsers > IE, Safari, Firefox and Opera (henceforth `all browsers'), and tentatively > avoiding technical details which are out of scope for HTML5 unless they are > important to gain a general understanding of the relevant issues. > > To avoid unnecessary confusion, the following three concepts are kept > distinct: > > 1) Character set: A collection of characters, typically defined as a matrix > with 94 rows and 94 columns. (A character set with more than one matrix is > said to have multiple planes.) The ones officially registered `for use with > escape sequences' (typically in ISO-2022 encodings, see below) can be found at > <http://www.itscj.ipsj.or.jp/ISO-IR/overview.htm>. > > 2) Encoding: Defines how a given character (typically defined by its row and > column numbers) from a given character set can be encoded as a sequence of > bytes. All the encodings discussed below allow multiple character sets to be > encoded. [ISO-2022 encodings use only 7-bit bytes and employ escape sequences > to switch between different character sets. EUC encodings use bytes < 128 for > ASCII (or something similar) and bytes >= 128 to encode other character sets.] > > 3) MIME charset string: This is the string used, e.g., in a HTTP Content-Type > header to indicate the *encoding*. Many of these can be found at > <http://www.iana.org/assignments/character-sets>. > > Some information about browser support for specific character sets, encodings > and MIME charset strings can be found at > <http://coq.no/character-tables/mime/iso-2022/en>, > <http://coq.no/character-tables/mime/euc/en> and > <http://coq.no/character-tables/mime/locale-specific/en>. > > The notation a < b means that a is a proper subset of b; a and b can be either > character sets or encodings. > > > ****************************************** > * What should HTML 5 say about all this? * > ****************************************** > > This section gives a summary of superset encodings which are either > universally supported or potentially needed for compatibility. > > (Anyone who is going to read the entire e-mail will probably prefer to read > the sections *Chinese*, *Japanese* and *Korean* at this point and return to > this section afterwards.) > > > Superset encodings (stricto sensu) > ---------------------------------- > > HTML5 currently contains a table of encodings aliases, of which the following > involve Chinese characters: > > 1) EUC-KR -> Windows-949 > 2) GB2312 -> GBK > 3) GB_2312-80 -> GBK > 4) KS_C_5601-1987 -> Windows-949 > 5) x-x-big5 -> Big5 > > EUC-KR < Windows-949, and all browsers do 1), so this is reasonable and > probably needed. > > GB2312 and GB_2312-80 technically refer to the *character set* GB 2312-80, > which can be expressed not only in EUC-CN encoding, but also in ISO-2022-CN > encoding and HZ encoding. GBK, on the other hand, is an encoding. EUC-CN < > GBK. It would be more correct to remove 2) and 3) and instead add: > EUC-CN -> GBK > > Admittedly, EUC-CN is sometimes called `8-bit GB encoding', and registered > MIME charset strings include GB_2312-80 and GB_2312-80 as distinct entries > (but not EUC-CN), so a note to this effect might be appropriate. > > (Additionally, GBK is slightly ambiguous, so make sure not to reference an > incomplete or outdated version without pointing out necessary > amendments/additions.) > > Similarly, EUC-KR is sometimes referred to as `eight-bit KS' or > `KS_C_5601-1987', which Ken Lunde characterises as `incorrect and dangerous' > in his book /CJKV Information Processing/. It would be more correct to remove > 4). > > Unlike EUC-CN, EUC-KR is a registered MIME charset string, but KS_C_5601-1987 > has a distinct entry, so a note might again be appropriate. > > As for 5), the MIME charset string x-x-big5 does indeed correspond to Big5 > encoding (or rather an extension thereof) in all browsers but Opera. There is > a large number of unregistered charset strings, however, and the other > mappings in this table are between encodings. Unless x-x-big5 is actually > supposed to refer to an encoding distinct from Big5, 5) should be removed. > > Instead (depending on the reference ultimately given for Big5), it may be > necessary to note that at least certain ETen extensions should be regarded as > part of Big5. I believe you misunderstand the purpose of this table. The idea is to give a mapping of _labels_ to encodings, not encodings to encodings. I've clarified the text to this effect. > In addition, Shift_JIS < Windows-31J, and all browsers implement this mapping, > so the following should be added: > Shift_JIS -> Windows-31J Added. I haven't added the mappings described below, since they are not all implemented uniformly. If specific mappings are important, I recommend contacting the browser vendors and getting them to implement them. I would like to have as few compatibility mappings as possible. > Further superset encodings (probably not needed) > ------------------------------------------------ > > ISO-2022-CN < ISO-2022-CN-EXT > > This is reasonable, but probably not necessary: Firefox does it, Safari does > not, Opera does not implement the superset, IE does not even implement the > subset. Distinguishing between them is pointless. > > > EUC-CN < GBK < GB18030 > > The first step is probably sufficient, and the second is potentially > problematic if an incompatible extension to GBK were to be invented. > > > ISO-2022-JP < ISO-2022-JP-1 < ISO-2022-JP-2 < ISO-2022-JP-3 < ISO-2022-JP-2004 > > No browser attempts to distinguish between these, which would be completely > meaningless. On the other hand, IE only implements ISO-2022-JP, and only > Firefox implements ISO-2022-JP-2, so these may not actually be necessary. > > > Shift_JIS_X0213-2000 < Shift_JIS-2004 > > Safari arguably does this, and there is no need to make a distinction between > them, but no browser seems to implement either in a meaningful way at the > moment. > > > Superset *character sets* (universally recognised) > -------------------------------------------------- > > JIS C 6226-1978 < JIS X 0208-1983 < JIS X 0208-1990/1997 > > Whenever one of the subsets are referred to in any variety of ISO-2022-JP, the > superset is used instead. > > > JIS X 0208-1990/1997 should be understood as including NEC and IBM extensions. > This character set is part of all varieties of ISO-2022-JP, as well as EUC-JP > and Shift-JIS. > > > KS X 1001:1992 < KS X 1001:1998 < KS X 1001:2002 > > Only three characters have been added in total. All but Safari have > implemented the two characters added in 1998. This character set is part of > ISO-2022-KR, EUC-KR and Johab. > > > Other additions to ISO-2022 encodings (potentially essential) > ------------------------------------------------------------- > > All varieties of ISO-2022-JP must include the Katakana character set which was > not officially added to the standard until ISO-2022-3. > > The escape sequence for Swedish should be accepted as a synonym for JIS-Roman. > > (IE furthermore allows to select katakana using shift-in/out.) > > All these extensions were originally defined in the older JIS encoding, which > predates ISO-2022-JP. > > > 8-bit bytes in 7-bit encodings > ------------------------------ > > IE interprets 8-bit bytes (i.e., octets with the high bit set) in 7-bit > encodings as if they had occurred in an 8-bit encoding of the same language, > viz: > > HZ-GB-2312 -> GBK > ISO-2022-JP -> Shift-JIS > ISO-2022-KR -> Windows-949 > > Other browsers (at least Safari and Opera) sometimes ignore the specified MIME > charset string and try to detect/sniff the encoding instead, which is prone to > error and no less `wrong'. > > I would suggest other browsers to support the mappings above, which should > hopefully enable them to trust the MIME charset string. > > *** > > The remainder of this e-mail gives further details about character sets > (single underline) and encodings (double underline), divided into three > sections according to the language for which they are intended (Chinese, > Japanese and Korean). > > > *********** > * Chinese * > *********** > > Character sets for simplified Chinese characters > ------------------------------------------------ > > GB2312-80 < GB 6345.1-86 < ISO-IR-165:1992 > > GB2312-80 < GB 8565.2-88 < ISO-IR-165:1992 > > (It follows that GB 6345.1-86 and GB 8565.2-88 have no conflicting > assignments.) > > Most browsers support only GB2312-80. Safari supports ISO-IR-165:1992 as > well, but the two are kept distinct. > > > Character sets for traditional Chinese characters > ------------------------------------------------- > > CNS 11643-1992: > Plane 1 and plane 2 defined in 1986. > Plane 14 added in 1988. > Plane 15 added in 1988. > In 1992, plane 3 was defined as the first part of plane 14, > the remainder of plane 14 was put into plane 4, many of the > characters from plane 15 were added to planes 4--7, other > characters were added to planes 4--7, and planes 14 and 15 were > removed; the result was seven planes, 1--7. > > > HZ encoding for simplified Chinese > ================================== > > HZ-GB-2312 supports: > - ASCII > ? GB2312-80 > > IE furthermore allows GB2312-80 encoded as in EUC-CN, as well as GBK > extensions (8-bit). > > > ISO-2022 encoding for traditional and simplified Chinese > ======================================================== > > ISO-2022-CN supports: > - ASCII > - GB2312-80 > - CNS 11643-1992, planes 1 and 2 > > ISO-2022-CN-EXT supports in addition: > - ISO-IR-165 > - CNS 11643-1992, planes 3--7 > - (theoretically, further character sets, but which cannot be > selected because escape sequences have not been allocated) > > IE does not support ISO-2022 for Chinese. > ISO-2022-CN-EXT is implemented in Safari (complete) and Firefox (missing > ISO-IR-165). > > ISO-2022-CN < ISO-2022-CN-EXT > > Firefox treats ISO-2022-CN as ISO-2022-CN-EXT, whereas Safari does not. There > does not seem to be any reason not to. > > > EUC encoding for simplified Chinese and extensions thereof > ========================================================== > > EUC-CN supports: > - ASCII > - GB2312-80 > > GBK adds in particular all Chinese characters in Unicode 1.1 not included in > GB2312-80. > > GB18030 adds all remaining Unicode characters. > > EUC-CN < GBK < GB2312-80 > > Windows-936 is very similar to GBK and probably the only variant implemented > in browsers. Windows-936 includes a few characters in addition to GBK; > conversely, GBK apparently includes some characters not in Windows-936, at > least not originally. GBK should probably refer to Windows-936, possibly with > later additions (I have yet to see an official GBK specification). > > All browsers (except Firefox) treat EUC-CN as GBK/Windows-936. > > Firefox instead treats EUC-CN as GB18030, keeping GBK/Windows-936 apart. > > Only Safari supports Mac-specific additions to EUC-CN called MacOS-S; IE and > Opera handles this as pure EUC-CN, which is a fairly good fall-back mechanism. > > > EUC encoding for traditional Chinese > ==================================== > > EUC-TW supports: > - ASCII > - CNS 11643-1992, planes 1--7 > > It may previously have included: > - CNS 11643-1992, planes 14 and 15 > > DEC Hanyu provides a different (8-bit) encoding for: > - CNS 11643-1992, planes 2--4 > > All browsers support ASCII and CNS 11643-1992, plane 1 (albeit IE, Safari and > Firefox each require a different MIME charset string!). > > Safari, Firefox and Opera support CNS 11643-1992, plane 2 encoding according > to EUC-TW; IE instead supports it when encoded as DEC Hanyu. > > Opera supports plane 14; Firefox supports planes 3--7. > > EUC-TW and DEC Hanyu are not conflicting, so it would be possible to support > planes 2--4 (or at least plane 2) according to both standards. Plane 1 can > already be encoded in two different ways according to EUC-TW (and Opera > supports both), so this does not really add any problems. Similarly, > supporting planes 14 and 15 as well as planes 2--7 is completely > unproblematic. However, the current degree of incompatibility between > browsers would seem to suggest that EUC-TW is not a very popular encoding. > > > Big5 encoding for traditional Chinese > ===================================== > > Big5 is (roughly) an encoding that supports: > - ASCII > - CNS 11643-1992, planes 1 and 2 > > (Historically, Big5 predates CNS 11643-1992) > > Extensions include: > - ETen > - MacOS-T > - Hong Kong extensions > - Big5+ > - Big5E > - Big5-2003 > - Unicode-At-On > > All browsers support some ETen extensions; only IE does not support them all. > > ETen and MacOS-T extensions are compatible, and IE supports both (given the > MIME charset string referring to MacOS-T), but Safari does not and this is > almost certainly not needed. > > Hong Kong extensions are incompatible with ETen extensions, so a separate MIME > charset string is needed to activate Hong Kong extensions. > > Big5 < Big5+ > Big5 < Big5E > Big5 < Big5-2003 > > However, these three extensions are all incompatible, and at least some of > them are incompatible with other extensions. > > Big5+ and the later, smaller Big5E are not implemented in browsers, as far as > I can tell. > > Firefox adds characters from Big5-2003 and (according to bug reports) > Unicode-At-On. I have not found an authoritative Big5-2003 specification, but > handling Big5 as Big5-2003 (adding at least ETen extensions if they are not > part of Big5-2003 already) might be a good idea. > > > ETen encoding for traditional Chinese > ===================================== > > ETen is an encoding that supports: > - ASCII > - CNS 11643-1992, planes 1 and 2 > - ETen extensions > > Only IE supports this particular encoding. > > > ************ > * Japanese * > ************ > > Character sets for Japanese characters > -------------------------------------- > > JIS X 0201 (Katakana) > JIS C 6226-1978 > JIS X 0208-1983 > JIS X 0208-1990/1997 > JIS X 0212-1990 > JIS X 0213-2000 Plane 1 > JIS X 0213-2000 Plane 2 > JIS X 0213-2004 Plane 1 > > JIS C 6226-1978 < JIS X 0208-1983 < JIS X 0208-1990/1997 < JIS X 0213-2000 > Plane 1 < JIS X 0213-2004 Plane 1 > > (There are a few incompatible changes, but those should officially be regarded > as `corrections'.) > > Characters from JIS X 0212-1990 were included in JIS X 0213-2000 Plane 1. > > There is also a Japanese ASCII variant (JIS Roman) with yen and macron instead > of backslash and tilde. However, IE makes no distinction between ASCII and > JIS Roman, but uses a hybrid if either is needed. > > IE furthermore shows a yen symbol for \. > > > ISO-2022 encoding for Japanese > ============================== > > ISO-2022-JP < ISO-2022-JP-1 < ISO-2022-JP-2 < ISO-2022-JP-3 < ISO-2022-JP-2004 > > JIS is a precursor for ISO-2022-JP. > > No browser distinguishes between any of these encodings. > > The following lists the character sets that can be encoded in different > variants of ISO-2022 according to the specifications. > > ISO-2022-JP: > - ASCII > - JIS Roman > - JIS C 6226-1978 > - JIS X 0208-1983 > > ISO-2022-JP-1 adds: > - JIS X 0212-1990 > > ISO-2022-JP-2 adds: > - GB 2312-80 (Chinese) > - KS X 1001 (Korean) > - ISO 8859-1 (Western-European) > - ISO 8859-7 (Monotonic Greek) > > ISO-2022-JP-3 adds: > - Katakana > - JIS X 0213-2000 Plane 1 > - JIS X 0213-2000 Plane 2 > > ISO-2022-JP-2004 adds: > - JIS X 0213-2004 Plane 1 > > In practice, the situation is rather different: > > The escape sequences reserved for JIS C 6226-1978 and JIS X 0208-1983 instead > selects the superset JIS X 0208-1990/1997, whose escape sequence is not > recognised. > > IE incorrectly selects JIS X 0208-1990/1997 also when the escape sequence for > JIS X 0212-1990 is used, but the two are completely incompatible. I have no > idea whether it is common to use the wrong escape sequence in this particular > case. > > Only Firefox supports the non-Japanese character sets added in ISO-2022-JP-2. > > No browser supports JIS X 0213 (in ISO-2022 encoding). > > Only Safari does not include IBM extensions, in both NEC and to the extent > possible IBM (non-Shift-JIS) positions. > > IE furthermore interprets 8-bit characters as Shift-JIS and allows > shift-in/shift-out control characters to indicate Katakana, as defined in the > earlier JIS standard. Other browsers might want to add this. (Some other IE > extensions are completely insane and almost certainly not needed for > compatibility.) > > The escape sequence reserved for 7-bit Swedish (which is not included in any > ISO-2022-JP variant) must instead select JIS Roman. > > > EUC encoding for Japanese > ========================= > > EUC-JP supports: > - ASCII > - JIS X 0208-1990/1997 > - Katakana > - JIS X 0212-1990 > > IE and Safari does not support JIS X 0212-1990. > > IBM extensions in NEC and to the extent possible IBM (non-Shift-JIS) positions > are universally supported (except for Safari, which does not support NEC > positions). > > > Shift-JIS encoding for Japanese > =============================== > > Shift-JIS supports: > - ASCII > - Katakana > - JIS X 0208-1990/1997 > > All browsers furthermore supports NEC symbols as well as IBM extensions in > both NEC and IBM (Shift-JIS) positions. This is actually Windows-932: > > Shift-JIS < Windows-932 > > There are also other extensions, incompatible with Windows-932: > > Shift-JIS < Shift-JIS X0213 < Shift-JIS-2004 > > Shift-JIS X0213 adds: > - Shift_JISX0213-2000 plane 1 > - Shift_JISX0213-2000 plane 2 > > Shift-JIS-2004 adds instead: > - Shift_JISX0213-2004 plane 1 > - Shift_JISX0213-2000 plane 2 (same as previous encoding) > > Safari supports the latter, but I have not yet found a MIME charset string > which triggers it. (Surprisingly and somewhat stupidly, Shift_JIS_X0213-2000 > triggers Windows-932 in Safari, whereas no other browser even supports this > string.) > > > ********** > * Korean * > ********** > > Character sets for Korean characters > ------------------------------------ > > KS X 1001:1992 > > Two characters were added in 1998, and another in 2002. Only Safari does > still not support the additions from 1998. > > Hangul syllables which are not included in precomposed form can be encoded as > 8-byte sequences, 2 bytes for for each of the following: specific > `composition' code, initial consonant, medial vowel, final consonant. This is > not supported unless noted otherwise below. (Not actually tested for Johab, > for which it is irrelevant.) > > IE uses a ASCII/KS-Roman hybrid with won instead of backslash (when compared > to ASCII) and furthermore displays won for \. > > > ISO-2022 encoding for Korean > ============================ > > ISO-2022-KR supports: > - ASCII > - KS X 1001:1992 > > Safari displays won instead of backslash (as IE does it for all encodings). > > IE treats 8-bit characters as Windows-949. > > > EUC encoding for Korean > ======================= > > EUC-KR supports: > - ASCII > - KS X 1001:1992 > > Firefox supports 8-byte Hangul encoding. > > Only Safari does not support the Microsoft UHC extension (which adds all > missing precomposed hangul). The combination is also known as Windows-949. > > Only Safari supports the Mac-specific HangulTalk extensions. > > EUC-KR < Windows-949 > EUC-KR < HangulTalk > > > Johab encoding for Korean > ========================= > > EUC-KR supports: > - ASCII > - KS X 1001:1992 (non-hangul) > - All possible hangul (including those in KS X 1001:1992) > > This encoding contains the same characters as Windows-949, but arranged more > systematically. Unfortunately, the encoding is not compatible with EUC-KR. > > Opera does not support Johab. Safari does not render my test page at all. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Wednesday, 3 June 2009 15:19:05 UTC