[whatwg/encoding] Confusion between Big5 and Big5-HKSCS encodings (#75) from Bruno Haible on 2016-10-03 (public-webapps-github@w3.org from October 2016)

From: Bruno Haible <notifications@github.com>
Date: Mon, 03 Oct 2016 15:13:37 -0700
To: whatwg/encoding <encoding@noreply.github.com>
Message-ID: <whatwg/encoding/issues/75@github.com>

The current draft maps the labels "big5" and "big5-hkscs" to a single encoding, and the mapping table that it uses (index-big5.txt) does not match either of the widely used mapping tables for BIG5 and BIG5-HKSCS.

In detail:

Glibc and other software consider BIG5 and BIG5-HKSCS to be different.
BIG5 is mainly used in Taiwan, and evolves over time without formal standards. But Microsoft is a strong player in this area, as it provides the Windows CP950 encoding.
BIG5-HKSCS is mainly used in Hong Kong, and evolves through regular updates of the standard. The current version BIG5-HKSCS:2008 is found in http://www.ogcio.gov.hk/en/business/tech_promotion/ccli/terms/doc/e_hkscs_2008.pdf .

The mapping in index-big5.txt differs from the common BIG5 mapping (Microsoft CP950): It adds 5085 characters, at the positions 0x8862, 0x8864, 0x88A3, 0x88A5, 0x8740..0xA0FE, 0xA3C0..0xA3E0, 0xC6A1..0xC8FE, 0xFA40..0xFEFE.

The mapping in index-big5.txt differs from the one in BIG5-HKSCS:2008:
It has additional mappings for 0x8E69, 0x8E6F, 0x8E7E, 0x8EAB, 0x8EB4, 0x8ECD, 0x8ED0, 0x8F57, 0x8F69, 0x8F6E, 0x8FCB..0x8FCC, 0x8FFE, 0x906D, 0x907A, 0x90DC, 0x90F1, 0x91BF, 0x9244, 0x92AF..0x92B2, 0x92C8, 0x92D1, 0x9447, 0x94CA, 0x95D9, 0x9644, 0x96ED, 0x96FC, 0x9B76, 0x9B78, 0x9B7B, 0x9BC6, 0x9BDE, 0x9BEC, 0x9BF6, 0x9C42, 0x9C53, 0x9C62, 0x9C68, 0x9C6B, 0x9C77, 0x9CBC, 0x9CBD, 0x9CD0, 0x9D57, 0x9D5A, 0x9DC4, 0x9EA9, 0x9EEF, 0x9EFD, 0x9F60, 0x9F66, 0x9FCB, 0x9FD8, 0xA063, 0xA077, 0xA0D5, 0xA0DF, 0xA0E4, 0xA15A, 0xA1C3, 0xA1C5, 0xA1FE, 0xA240, 0xA2CC, 0xA2CE, 0xA3C0..0xA3E1, 0xC6CF, 0xC6D3, 0xC6D5, 0xC6D7, 0xC6DE..0xC6DF, 0xFA5F, 0xFA66, 0xFABD, 0xFAC5, 0xFAD5, 0xFB48, 0xFBB8, 0xFBF3, 0xFBF9, 0xFC4F, 0xFC6C, 0xFCB9, 0xFCE2, 0xFCF1, 0xFDB7, 0xFDB8, 0xFDBB, 0xFDF1, 0xFE52, 0xFE6F, 0xFEAA, 0xFEDD,
and differs in the mappings of 0xA145, 0xA14E, 0xA1C2, 0xA1E3, 0xA1F2..0xA1F3, 0xA241..0xA242, 0xA244, 0xA246..0xA247.

For details about the mapping tables, see
http://haible.de/bruno/charsets/conversion-tables/index.html
http://haible.de/bruno/charsets/conversion-tables/Big5.html
http://haible.de/bruno/charsets/conversion-tables/BIG5-HKSCS.html

In summary, it looks like this encoding is a clever merge between CP950 and BIG5-HKSCS, giving priority to the CP950 mapping value in those 11 positions where there is a conflict. This is very good for converting BIG5/BIG5-HKSCS texts of either kind to Unicode (the "decoder" part). However, when the "encoder" is used, it will likely generate HKSCS code points, which cannot be displayed in CP950. What shall the Windows users with this CP950 encoding do?


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/75

Received on Monday, 3 October 2016 22:14:49 UTC