[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

On Tue, 14 Apr 2009, ?istein E. Andersen wrote:
>
> This e-mail is an attempt to give a relatively concise yet reasonably complete
> overview of non-Unicode character sets and encodings for `Chinese characters',
> excluding those which are not supported by at least one of the four browsers
> IE, Safari, Firefox and Opera (henceforth `all browsers'), and tentatively
> avoiding technical details which are out of scope for HTML5 unless they are
> important to gain a general understanding of the relevant issues.
> 
> To avoid unnecessary confusion, the following three concepts are kept
> distinct:
> 
> 1) Character set: A collection of characters, typically defined as a matrix
> with 94 rows and 94 columns.  (A character set with more than one matrix is
> said to have multiple planes.)  The ones officially registered `for use with
> escape sequences' (typically in ISO-2022 encodings, see below) can be found at
> <http://www.itscj.ipsj.or.jp/ISO-IR/overview.htm>.
> 
> 2) Encoding: Defines how a given character (typically defined by its row and
> column numbers) from a given character set can be encoded as a sequence of
> bytes.  All the encodings discussed below allow multiple character sets to be
> encoded.  [ISO-2022 encodings use only 7-bit bytes and employ escape sequences
> to switch between different character sets. EUC encodings use bytes < 128 for
> ASCII (or something similar) and bytes >= 128 to encode other character sets.]
> 
> 3) MIME charset string: This is the string used, e.g., in a HTTP Content-Type
> header to indicate the *encoding*.  Many of these can be found at
> <http://www.iana.org/assignments/character-sets>.
> 
> Some information about browser support for specific character sets, encodings
> and MIME charset strings can be found at
> <http://coq.no/character-tables/mime/iso-2022/en>,
> <http://coq.no/character-tables/mime/euc/en> and
> <http://coq.no/character-tables/mime/locale-specific/en>.
> 
> The notation a < b means that a is a proper subset of b; a and b can be either
> character sets or encodings.
> 
> 
> ******************************************
> * What should HTML 5 say about all this? *
> ******************************************
> 
> This section gives a summary of superset encodings which are either
> universally supported or potentially needed for compatibility.
> 
> (Anyone who is going to read the entire e-mail will probably prefer to read
> the sections *Chinese*, *Japanese* and *Korean* at this point and return to
> this section afterwards.)
> 
> 
> Superset encodings (stricto sensu)
> ----------------------------------
> 
> HTML5 currently contains a table of encodings aliases, of which the following
> involve Chinese characters:
> 
> 1) EUC-KR          ->  Windows-949
> 2) GB2312          ->  GBK
> 3) GB_2312-80      ->  GBK
> 4) KS_C_5601-1987  ->  Windows-949
> 5) x-x-big5        ->  Big5
> 
> EUC-KR < Windows-949, and all browsers do 1), so this is reasonable and
> probably needed.
> 
> GB2312 and GB_2312-80 technically refer to the *character set* GB 2312-80,
> which can be expressed not only in EUC-CN encoding, but also in ISO-2022-CN
> encoding and HZ encoding.  GBK, on the other hand, is an encoding.  EUC-CN <
> GBK.  It would be more correct to remove 2) and 3) and instead add:
>    EUC-CN      ->  GBK
> 
> Admittedly, EUC-CN is sometimes called `8-bit GB encoding', and registered
> MIME charset strings include GB_2312-80 and GB_2312-80 as distinct entries
> (but not EUC-CN), so a note to this effect might be appropriate.
> 
> (Additionally, GBK is slightly ambiguous, so make sure not to reference an
> incomplete or outdated version without pointing out necessary
> amendments/additions.)
> 
> Similarly, EUC-KR is sometimes referred to as `eight-bit KS' or
> `KS_C_5601-1987', which Ken Lunde characterises as `incorrect and dangerous'
> in his book /CJKV Information Processing/.  It would be more correct to remove
> 4).
> 
> Unlike EUC-CN, EUC-KR is a registered MIME charset string, but KS_C_5601-1987
> has a distinct entry, so a note might again be appropriate.
> 
> As for 5), the MIME charset string x-x-big5 does indeed correspond to Big5
> encoding (or rather an extension thereof) in all browsers but Opera.  There is
> a large number of unregistered charset strings, however, and the other
> mappings in this table are between encodings.  Unless x-x-big5 is actually
> supposed to refer to an encoding distinct from Big5, 5) should be removed.
> 
> Instead (depending on the reference ultimately given for Big5), it may be
> necessary to note that at least certain ETen extensions should be regarded as
> part of Big5.

I believe you misunderstand the purpose of this table. The idea is to give 
a mapping of _labels_ to encodings, not encodings to encodings. I've 
clarified the text to this effect.



> In addition, Shift_JIS < Windows-31J, and all browsers implement this mapping,
> so the following should be added:
>    Shift_JIS       ->  Windows-31J

Added.


I haven't added the mappings described below, since they are not all 
implemented uniformly. If specific mappings are important, I recommend 
contacting the browser vendors and getting them to implement them. I would 
like to have as few compatibility mappings as possible.


> Further superset encodings (probably not needed)
> ------------------------------------------------
> 
> ISO-2022-CN < ISO-2022-CN-EXT
> 
> This is reasonable, but probably not necessary: Firefox does it, Safari does
> not, Opera does not implement the superset, IE does not even implement the
> subset.  Distinguishing between them is pointless.
> 
> 
> EUC-CN < GBK < GB18030
> 
> The first step is probably sufficient, and the second is potentially
> problematic if an incompatible extension to GBK were to be invented.
> 
> 
> ISO-2022-JP < ISO-2022-JP-1 < ISO-2022-JP-2 < ISO-2022-JP-3 < ISO-2022-JP-2004
> 
> No browser attempts to distinguish between these, which would be completely
> meaningless.  On the other hand, IE only implements ISO-2022-JP, and only
> Firefox implements ISO-2022-JP-2, so these may not actually be necessary.
> 
> 
> Shift_JIS_X0213-2000 < Shift_JIS-2004
> 
> Safari arguably does this, and there is no need to make a distinction between
> them, but no browser seems to implement either in a meaningful way at the
> moment.
> 
> 
> Superset *character sets* (universally recognised)
> --------------------------------------------------
> 
> JIS C 6226-1978 < JIS X 0208-1983 < JIS X 0208-1990/1997
> 
> Whenever one of the subsets are referred to in any variety of ISO-2022-JP, the
> superset is used instead.
> 
> 
> JIS X 0208-1990/1997 should be understood as including NEC and IBM extensions.
> This character set is part of all varieties of ISO-2022-JP, as well as EUC-JP
> and Shift-JIS.
> 
> 
> KS X 1001:1992 < KS X 1001:1998 < KS X 1001:2002
> 
> Only three characters have been added in total.  All but Safari have
> implemented the two characters added in 1998.  This character set is part of
> ISO-2022-KR, EUC-KR and Johab.
> 
> 
> Other additions to ISO-2022 encodings (potentially essential)
> -------------------------------------------------------------
> 
> All varieties of ISO-2022-JP must include the Katakana character set which was
> not officially added to the standard until ISO-2022-3.
> 
> The escape sequence for Swedish should be accepted as a synonym for JIS-Roman.
> 
> (IE furthermore allows to select katakana using shift-in/out.)
> 
> All these extensions were originally defined in the older JIS encoding, which
> predates ISO-2022-JP.
> 
> 
> 8-bit bytes in 7-bit encodings
> ------------------------------
> 
> IE interprets 8-bit bytes (i.e., octets with the high bit set) in 7-bit
> encodings as if they had occurred in an 8-bit encoding of the same language,
> viz:
> 
> 	HZ-GB-2312   ->   GBK
> 	ISO-2022-JP  ->   Shift-JIS
> 	ISO-2022-KR  ->   Windows-949
> 
> Other browsers (at least Safari and Opera) sometimes ignore the specified MIME
> charset string and try to detect/sniff the encoding instead, which is prone to
> error and no less `wrong'.
> 
> I would suggest other browsers to support the mappings above, which should
> hopefully enable them to trust the MIME charset string.
> 
>                               ***
> 
> The remainder of this e-mail gives further details about character sets
> (single underline) and encodings (double underline), divided into three
> sections according to the language for which they are intended (Chinese,
> Japanese and Korean).
> 
> 
> ***********
> * Chinese *
> ***********
> 
> Character sets for simplified Chinese characters
> ------------------------------------------------
> 
> GB2312-80 < GB 6345.1-86 < ISO-IR-165:1992
> 
> GB2312-80 < GB 8565.2-88 < ISO-IR-165:1992
> 
> (It follows that GB 6345.1-86 and GB 8565.2-88 have no conflicting
> assignments.)
> 
> Most browsers support only GB2312-80.  Safari supports ISO-IR-165:1992 as
> well, but the two are kept distinct.
> 
> 
> Character sets for traditional Chinese characters
> -------------------------------------------------
> 
> CNS 11643-1992:
> Plane 1 and plane 2 defined in 1986.
> Plane 14 added in 1988.
> Plane 15 added in 1988.
> In 1992, plane 3 was defined as the first part of plane 14,
> the remainder of plane 14 was put into plane 4, many of the
> characters from plane 15 were added to planes 4--7, other
> characters were added to planes 4--7, and planes 14 and 15 were
> removed; the result was seven planes, 1--7.
> 
> 
> HZ encoding for simplified Chinese
> ==================================
> 
> HZ-GB-2312 supports:
> - ASCII
> ? GB2312-80
> 
> IE furthermore allows GB2312-80 encoded as in EUC-CN, as well as GBK
> extensions (8-bit).
> 
> 
> ISO-2022 encoding for traditional and simplified Chinese
> ========================================================
> 
> ISO-2022-CN supports:
> - ASCII
> - GB2312-80
> - CNS 11643-1992, planes 1 and 2
> 
> ISO-2022-CN-EXT supports in addition:
> - ISO-IR-165
> - CNS 11643-1992, planes 3--7
> - (theoretically, further character sets, but which cannot be
>    selected because escape sequences have not been allocated)
> 
> IE does not support ISO-2022 for Chinese.
> ISO-2022-CN-EXT is implemented in Safari (complete) and Firefox (missing
> ISO-IR-165).
> 
> ISO-2022-CN < ISO-2022-CN-EXT
> 
> Firefox treats ISO-2022-CN as ISO-2022-CN-EXT, whereas Safari does not.  There
> does not seem to be any reason not to.
> 
> 
> EUC encoding for simplified Chinese and extensions thereof
> ==========================================================
> 
> EUC-CN supports:
> - ASCII
> - GB2312-80
> 
> GBK adds in particular all Chinese characters in Unicode 1.1 not included in
> GB2312-80.
> 
> GB18030 adds all remaining Unicode characters.
> 
> EUC-CN < GBK < GB2312-80
> 
> Windows-936 is very similar to GBK and probably the only variant implemented
> in browsers.  Windows-936 includes a few characters in addition to GBK;
> conversely, GBK apparently includes some characters not in Windows-936, at
> least not originally.  GBK should probably refer to Windows-936, possibly with
> later additions (I have yet to see an official GBK specification).
> 
> All browsers (except Firefox) treat EUC-CN as GBK/Windows-936.
> 
> Firefox instead treats EUC-CN as GB18030, keeping GBK/Windows-936 apart.
> 
> Only Safari supports Mac-specific additions to EUC-CN called MacOS-S; IE and
> Opera handles this as pure EUC-CN, which is a fairly good fall-back mechanism.
> 
> 
> EUC encoding for traditional Chinese
> ====================================
> 
> EUC-TW supports:
> - ASCII
> - CNS 11643-1992, planes 1--7
> 
> It may previously have included:
> - CNS 11643-1992, planes 14 and 15
> 
> DEC Hanyu provides a different (8-bit) encoding for:
> - CNS 11643-1992, planes 2--4
> 
> All browsers support ASCII and CNS 11643-1992, plane 1 (albeit IE, Safari and
> Firefox each require a different MIME charset string!).
> 
> Safari, Firefox and Opera support CNS 11643-1992, plane 2 encoding according
> to EUC-TW; IE instead supports it when encoded as DEC Hanyu.
> 
> Opera supports plane 14; Firefox supports planes 3--7.
> 
> EUC-TW and DEC Hanyu are not conflicting, so it would be possible to support
> planes 2--4 (or at least plane 2) according to both standards. Plane 1 can
> already be encoded in two different ways according to EUC-TW (and Opera
> supports both), so this does not really add any problems.  Similarly,
> supporting planes 14 and 15 as well as planes 2--7 is completely
> unproblematic.  However, the current degree of incompatibility between
> browsers would seem to suggest that EUC-TW is not a very popular encoding.
> 
> 
> Big5 encoding for traditional Chinese
> =====================================
> 
> Big5 is (roughly) an encoding that supports:
> - ASCII
> - CNS 11643-1992, planes 1 and 2
> 
> (Historically, Big5 predates CNS 11643-1992)
> 
> Extensions include:
> - ETen
> - MacOS-T
> - Hong Kong extensions
> - Big5+
> - Big5E
> - Big5-2003
> - Unicode-At-On
> 
> All browsers support some ETen extensions; only IE does not support them all.
> 
> ETen and MacOS-T extensions are compatible, and IE supports both (given the
> MIME charset string referring to MacOS-T), but Safari does not and this is
> almost certainly not needed.
> 
> Hong Kong extensions are incompatible with ETen extensions, so a separate MIME
> charset string is needed to activate Hong Kong extensions.
> 
> Big5 < Big5+
> Big5 < Big5E
> Big5 < Big5-2003
> 
> However, these three extensions are all incompatible, and at least some of
> them are incompatible with other extensions.
> 
> Big5+ and the later, smaller Big5E are not implemented in browsers, as far as
> I can tell.
> 
> Firefox adds characters from Big5-2003 and (according to bug reports)
> Unicode-At-On.  I have not found an authoritative Big5-2003 specification, but
> handling Big5 as Big5-2003 (adding at least ETen extensions if they are not
> part of Big5-2003 already) might be a good idea.
> 
> 
> ETen encoding for traditional Chinese
> =====================================
> 
> ETen is an encoding that supports:
> - ASCII
> - CNS 11643-1992, planes 1 and 2
> - ETen extensions
> 
> Only IE supports this particular encoding.
> 
> 
> ************
> * Japanese *
> ************
> 
> Character sets for Japanese characters
> --------------------------------------
> 
> JIS X 0201 (Katakana)
> JIS C 6226-1978
> JIS X 0208-1983
> JIS X 0208-1990/1997
> JIS X 0212-1990
> JIS X 0213-2000 Plane 1
> JIS X 0213-2000 Plane 2
> JIS X 0213-2004 Plane 1
> 
> JIS C 6226-1978 < JIS X 0208-1983 < JIS X 0208-1990/1997 < JIS X 0213-2000
> Plane 1 < JIS X 0213-2004 Plane 1
> 
> (There are a few incompatible changes, but those should officially be regarded
> as `corrections'.)
> 
> Characters from JIS X 0212-1990 were included in JIS X 0213-2000 Plane 1.
> 
> There is also a	Japanese ASCII variant (JIS Roman) with yen and macron instead
> of backslash and tilde.  However, IE makes no distinction between ASCII and
> JIS Roman, but uses a hybrid if either is needed.
> 
> IE furthermore shows a yen symbol for &#x5C;.
> 
> 
> ISO-2022 encoding for Japanese
> ==============================
> 
> ISO-2022-JP < ISO-2022-JP-1 < ISO-2022-JP-2 < ISO-2022-JP-3 < ISO-2022-JP-2004
> 
> JIS is a precursor for ISO-2022-JP.
> 
> No browser distinguishes between any of these encodings.
> 
> The following lists the character sets that can be encoded in different
> variants of ISO-2022 according to the specifications.
> 
> ISO-2022-JP:
> - ASCII
> - JIS Roman
> - JIS C 6226-1978
> - JIS X 0208-1983
> 
> ISO-2022-JP-1 adds:
> - JIS X 0212-1990
> 
> ISO-2022-JP-2 adds:
> - GB 2312-80 (Chinese)
> - KS X 1001 (Korean)
> - ISO 8859-1 (Western-European)
> - ISO 8859-7 (Monotonic Greek)
> 
> ISO-2022-JP-3 adds:
> - Katakana
> - JIS X 0213-2000 Plane 1
> - JIS X 0213-2000 Plane 2
> 
> ISO-2022-JP-2004 adds:
> - JIS X 0213-2004 Plane 1
> 
> In practice, the situation is rather different:
> 
> The escape sequences reserved for JIS C 6226-1978 and JIS X 0208-1983 instead
> selects the superset JIS X 0208-1990/1997, whose escape sequence is not
> recognised.
> 
> IE incorrectly selects JIS X 0208-1990/1997 also when the escape sequence for
> JIS X 0212-1990 is used, but the two are completely incompatible.  I have no
> idea whether it is common to use the wrong escape sequence in this particular
> case.
> 
> Only Firefox supports the non-Japanese character sets added in ISO-2022-JP-2.
> 
> No browser supports JIS X 0213 (in ISO-2022 encoding).
> 
> Only Safari does not include IBM extensions, in both NEC and to the extent
> possible IBM (non-Shift-JIS) positions.
> 
> IE furthermore interprets 8-bit characters as Shift-JIS and allows
> shift-in/shift-out control characters to indicate Katakana, as defined in the
> earlier JIS standard. Other browsers might want to add this.  (Some other IE
> extensions are completely insane and almost certainly not needed for
> compatibility.)
> 
> The escape sequence reserved for 7-bit Swedish (which is not included in any
> ISO-2022-JP variant) must instead select JIS Roman.
> 
> 
> EUC encoding for Japanese
> =========================
> 
> EUC-JP supports:
> - ASCII
> - JIS X 0208-1990/1997
> - Katakana
> - JIS X 0212-1990
> 
> IE and Safari does not support JIS X 0212-1990.
> 
> IBM extensions in NEC and to the extent possible IBM (non-Shift-JIS) positions
> are universally supported (except for Safari, which does not support NEC
> positions).
> 
> 
> Shift-JIS encoding for Japanese
> ===============================
> 
> Shift-JIS supports:
> - ASCII
> - Katakana
> - JIS X 0208-1990/1997
> 
> All browsers furthermore supports NEC symbols as well as IBM extensions in
> both NEC and IBM (Shift-JIS) positions.  This is actually Windows-932:
> 
> Shift-JIS < Windows-932
> 
> There are also other extensions, incompatible with Windows-932:
> 
> Shift-JIS < Shift-JIS X0213 < Shift-JIS-2004
> 
> Shift-JIS X0213 adds:
> - Shift_JISX0213-2000 plane 1
> - Shift_JISX0213-2000 plane 2
> 
> Shift-JIS-2004 adds instead:
> - Shift_JISX0213-2004 plane 1
> - Shift_JISX0213-2000 plane 2 (same as previous encoding)
> 
> Safari supports the latter, but I have not yet found a MIME charset string
> which triggers it. (Surprisingly and somewhat stupidly, Shift_JIS_X0213-2000
> triggers Windows-932 in Safari, whereas no other browser even supports this
> string.)
> 
> 
> **********
> * Korean *
> **********
> 
> Character sets for Korean characters
> ------------------------------------
> 
> KS X 1001:1992
> 
> Two characters were added in 1998, and another in 2002.  Only Safari does
> still not support the additions from 1998.
> 
> Hangul syllables which are not included in precomposed form can be encoded as
> 8-byte sequences, 2 bytes for for each of the following: specific
> `composition' code, initial consonant, medial vowel, final consonant.  This is
> not supported unless noted otherwise below. (Not actually tested for Johab,
> for which it is irrelevant.)
> 
> IE uses a ASCII/KS-Roman hybrid with won instead of backslash (when compared
> to ASCII) and furthermore displays won for &#x5C;.
> 
> 
> ISO-2022 encoding for Korean
> ============================
> 
> ISO-2022-KR supports:
> - ASCII
> - KS X 1001:1992
> 
> Safari displays won instead of backslash (as IE does it for all encodings).
> 
> IE treats 8-bit characters as Windows-949.
> 
> 
> EUC encoding for Korean
> =======================
> 
> EUC-KR supports:
> - ASCII
> - KS X 1001:1992
> 
> Firefox supports 8-byte Hangul encoding.
> 
> Only Safari does not support the Microsoft UHC extension (which adds all
> missing precomposed hangul).  The combination is also known as Windows-949.
> 
> Only Safari supports the Mac-specific HangulTalk extensions.
> 
> EUC-KR < Windows-949
> EUC-KR < HangulTalk
> 
> 
> Johab encoding for Korean
> =========================
> 
> EUC-KR supports:
> - ASCII
> - KS X 1001:1992 (non-hangul)
> - All possible hangul (including those in KS X 1001:1992)
> 
> This encoding contains the same characters as Windows-949, but arranged more
> systematically.  Unfortunately, the encoding is not compatible with EUC-KR.
> 
> Opera does not support Johab.  Safari does not render my test page at all.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Wednesday, 3 June 2009 15:19:05 UTC