[whatwg] Encoding: big5 and big5-hkscs from Øistein E. Andersen on 2012-04-07 (public-whatwg-archive@w3.org from April 2012)

From: Øistein E. Andersen <liszt@coq.no>
Date: Sat, 7 Apr 2012 15:04:55 +0100
Message-ID: <F4845274-5A40-460F-8A9C-8301A415D80B@coq.no>
On Fri Apr 6 14:03:22 PDT 2012, Philip J?genstedt <philipj at opera.com> wrote:

> So, <http://people.opera.com/philipj/2012/04/06/big5-foolip.txt> is the  
> mapping I suggest, with 18594 defined mappings and 1188 U+FFFD.

(Second byte 0xA1 appears as 0x7F in the mapping file.)

Your table is very similar to my idea of an ideal Big5-HKSCS decoder, which follows Unihan, official Big5-HKSCS mappings and real implementations closely.  My version is documented in detail at <http://coq.no/character-tables/chinese-traditional/en> and summarised at <http://coq.no/character-tables/charset5/en>.

Only 2 mappings differ, viz, C6CD (U+2F33 v. U+5E7A) and F9FE (U+FFED v. U+2593), which is quite reassuring given that we have worked independently and used somewhat different approaches.  The two divergent mappings and related issues are discussed below.

***

> [in summary:]
> C6CF => U+5EF4 ? [v.] U+2F35 ?
> C6D3 => U+65E0 ? [v.] U+2F46 ?
> C6D5 => U+7676 ? [v.] U+2F68 ?
> C6D7 => U+96B6 ? [v.] U+2FAA ?
> C6DE => U+3003 ?[v.] U+F6EE [PUA]
> C6DF => U+4EDD ? [v.]  U+F6EF [PUA]

To this list can be added:

C6CD => U+5E7A v. U+2F33

These seven characters are all part of the E-Ten 1 extension [1] to Big5, which is included in all implementations of Big5 (with or without HK extensions) that I have come across in browsers .  The official Big5-HKSCS table includes the E-Ten 1 extension as well, but the seven characters listed above appear elsewhere in the HKSCS extensions and are handled specially to avoid encoding the same character more than once.

	[1] <http://coq.no/character-tables/eten1.pdf>  <http://coq.no/character-tables/eten1.js>

Five are Kangxi radicals and encoded twice in Unicode (once as radicals, once as normal Han characters).  The official Big5-HKSCS table maps C6CD to the Unicode Kangxi radical U+2F33 and does not list the remaining four codepoints at all.  U+2F33 is the only Unicode Kangxi radical included in the official Big5-HKSCS table.  As you have noticed, some Big5-HKSCS implementations follow this idea for the remaining four as well.  It seems better to follow non-HK Big5 implementations here and map all five to normal Unicode Han characters.

For the last two characters, U+3003 and U+4EDD, there is only one possible Unicode mapping, so duplicates are impossible to avoid (without using PUA characters).  The official Big5-HKSCS table does not map C6DE and C6DF to anything.

Suggested change:  map C6CD to U+5E7A.

***

On Fri Apr 6 06:42:26 PDT 2012, Philip J?genstedt <philipj at opera.com> wrote:

> Also, a single mapping fails the Big5-contra[di]ction test:
> 
> F9FE =>
> opera-hk: U+FFED ?
> firefox: U+2593 ?
> chrome: U+2593 ?
> firefox-hk: U+2593 ?
> opera: U+2593 ?
> chrome-hk: U+FFED ?
> internetexplorer: U+2593 ?
> hkscs-2008: <U+FFED> ?
> 
> I'd say that we should go with U+FFED here, since that's what the [HKSCS-2008] spec  
> says and it's visually close anyway.

Given that the goal is to define a unified Big5 (non-HK) and Big5-HKSCS encoding and that this seems to be a case of the HK standard going against everything and everyone else, perhaps more weight should be given to existing specifications and (non-HK-specific) implementations.

Suggested change:  map F9FE to U+2593

***

Duplicates and reverse mappings:

big5-foolip.txt currently provides two different codepoints for 100 Unicode characters.

6 are mentioned above.  84 result from compatibility mappings defined in the official HKSCS-2008 specification, cf. [2].  This leaves 10:

	[2] <http://coq.no/character-tables/o-h-comp.pdf> <http://coq.no/character-tables/o-h-comp.js>

U+5341 (?, 'ten') and U+5345 (? 'thirty') are encoded twice in Big5, once as numerals and once as standard Han characters.  (U+5344 ? 'twenty' is only encoded once in Big5, but was added to HKSCS and is now one of the 84 compatibility mappings.)

According to Lunde, the four codepoints F9FA--F9FD (ETen-2 extension) are supposed to encode double-stroked circle segments which appear to be missing from Unicode (I am not sure whether they have ever been proposed for inclusion).  They are currently mapped to single-stroked circle segments instead, but those are already encoded at A2E7, A2A1--A2A3 (original Big5).

The four codepoints F9F9, F9E9, F9EB, F9EA (ETen-2 extension) encode line-drawing characters with a double horizontal line.  These appear to be encoded at A2A4--A2A7 (original Big5) already, and it is not clear to me whether the characters at A2A4--A2A7 are supposed to be different or whether ETen chose to encode them again to have a full set of line-drawing characters in one location.

Suggested reverse mappings:

C6CF <= U+5EF4
C6D3 <= U+65E0
C6D5 <= U+7676
C6D7 <= U+96B6
C6DE <= U+3003
C6DF <= U+4EDD
C6CD <= 5E7A (if the mapping of C6CD is changed)
Rationale:  Only these mappings will work for non-HK Big5 implementations, and these characters appear to be important not only in Hong Kong.

A451 <= U+5341
A4CA <= U+5345

A27E <= U+256D
A2A1<= U+256E
A2A3 <= U+256F
A2A2 <= U+2570

F9F9 <= U+2550
F9E9 <= U+255E
F9EB <= U+2561
F9EA <= U+256A

(The 84 compatibility mappings should obviously only be used to decode and never as reverse mappings.)

***

On Fri Apr 6 14:03:22 PDT 2012, Philip J?genstedt <philipj at opera.com> wrote:

> There are 29 mappings to U+003F (?) in IE that no other browser has.

Are you referring to the ones at A3E2--A3FE?  IE decodes (or used to decode) the control pictures at A3C0--A3E0 as C0 control characters in plain text, but replace(s) them with question marks in HTML.  It looks like this treatment has been extended to the the remaining A3xx codepoints (after the euro), perhaps without a good reason.

> The remaining mappings are to PUA or U+FFFD in all browsers [...]. Mapping  
> these to U+FFFD unless anyone finds pages using these byte sequences seems  
> the only sane option.

Agreed.  Do any of these ever render in a meaningful way (e.g., in IE on a Windows machine with HK locale and appropriate HKSCS PUA fonts)?

The following 22 codepoints are 'reserved for backwards compatibility' in the HKSCS-2008 standard, but no Unicode mappings are provided:

9EAC
9EC4
9EF4
9F4E
9FAD
9FB1
9FC0
9FC8
9FDA
9FE6
9FEA
9FEF
A054
A057
A05A
A062
A072
A0A5
A0AD
A0AF
A0D3
A0E1

I assume some systems will render at least these as potentially meaningful Han characters.

-- 
?istein E. Andersen
Received on Saturday, 7 April 2012 07:04:55 UTC