[whatwg] Encoding: big5 and big5-hkscs from Øistein E. Andersen on 2012-04-09 (public-whatwg-archive@w3.org from April 2012)

From: Øistein E. Andersen <liszt@coq.no>
Date: Mon, 9 Apr 2012 02:08:20 +0100
Message-ID: <12412C9A-94B1-4F36-8FAA-27DE8BD75867@coq.no>
On 8 Apr 2012, at 18:03, Philip J?genstedt wrote:

> On Sat, 07 Apr 2012 16:04:55 +0200, ?istein E. Andersen <liszt at coq.no> wrote:
> 
>> [...]
>> 	[1] <http://coq.no/character-tables/eten1.pdf>  <http://coq.no/character-tables/eten1.js>
> 
> What is the source for the mappings in eten1.pdf?

There is no authoritative source for those Unicode mappings, I am afraid.  Lunde's CJKV opus was my main source, but it does not include Unicode mappings.  Browser implementations and the HKSCS standard were also taken into account, but the final mapping does not reflect any one source.

> I assume that E-Ten was originally just some Big5 fonts with no defined mappings to Unicode?

Yes, I think so.

>> Suggested change:  map C6CD to U+5E7A.
> 
> These are the existing mappings:
> 
> C6CD =>
> opera-hk: U+2F33 ?
> firefox: U+5E7A ?
> chrome: U+F6DD ?
> firefox-hk: U+5E7A ?
> opera: U+2F33 ?
> chrome-hk: U+2F33 ?
> internetexplorer: U+F6DD ?
> hkscs-2008: <U+2F33> ?
> 
> At least on the Web, this isn't a question of HK vs non-HK mappings. Other than Firefox, which (de-facto) specs or implementations use U+5E7A?

I have now had a closer look at my notes (<http://coq.no/character-tables/chinese-traditional/en>). My argument for U+5E7A goes as follows:

Of the 214 Kangxi radicals, 186 appear (as normal Han character) in CNS 11643 Planes 1 or 2, whereas 25 appear in Plane 3 and 3 are missing altogether.  Big5 only covers Planes 1 and 2, which means that 28 Kangxi radicals (which may be rare in running text, but are nevertheless important) are missing.  The E-Ten 1 extension encodes 25 of the missing radicals in the range C6BF--C6D7.  Unlike CNS 11643 and Unicode, Big5 does not encode radicals twice (as radicals and normal characters).  This means that Big5 with the E-Ten 1 extension contains 211 of the 214 Kangxi radicals, all mapped to normal Han characters, and no codepoints mapped to Unicode Kangxi Radicals in the range U+2F00--U+2FD5.

In summary:  although E-Ten 1 was not defined in terms of Unicode, it is clear that the 25 radicals were all meant to map to normal Han characters, not to the special radical characters found in CNS 11643 and Unicode.

Enter HKSCS.  20 of the E-Ten 1 Kangxi radical mappings (along with the rest of E-Ten 1 and E-Ten 2, or almost) are adopted, but the remaining 5 are instead given new codepoints elsewhere.  Whatever the reason be, 4 of the 5 unused E-Ten positions are simply left undefined in the HKSCS standard, which is not much of a problem for a unified HK/non-HK Big5 encoding.  Unfortunately, the position C6CD was not left undefined, but instead mapped to U+2F33 (?), the Unicode Kangxi Radical version of U+5E7A (?), thus introducing not only the only Unicode Kangxi Radical into the HKSCS standard, but also a Unicode mapping that is incompatible with previous Big5 versions.  I wish I knew why.

> Possibly, one could argue that U+2F33 normalizes (NFKC) to U+5E7A, but it's not the only hanzi in HKSCS-2008 that normalizes into something else:
> 
> 8BC3 => <U+2F878> ? => <U+5C6E> ?
> 8BF8 => <U+F907> ? => <U+9F9C> ?
> 8EFD => <U+2F994> ? => <U+82B3> ?
> 8FA8 => <U+2F9B2> ? => <U+456B> ?
> 8FF0 => <U+2F9D4> ? => <U+8CAB> ?
> C6CD => <U+2F33> ? => <U+5E7A> ?
> 957A => <U+2F9BC> ? => <U+8728> ?
> 9874 => <U+2F825> ? => <U+52C7> ?
> 9AC8 => <U+2F83B> ? => <U+5406> ?
> 9C52 => <U+2F8CD> ? => <U+6649> ?
> A047 => <U+2F840> ? => <U+54A2> ?
> FC48 => <U+2F894> ? => <U+5F22> ?
> FC77 => <U+2F8A6> ? => <U+6148> ?

The other pairs all contain characters that look slightly different, whereas U+5E7A and U+2F33 look the same (and, I believe, are supposed to look the same), the only difference being that the former is a normal Han character whereas the latter carries the additional semantics of being a Kangxi radical.

> I'm not sure what the conclusion is...

I am not entirely sure either.  It seems clear that the mapping from C6CD to U+2F33 makes no sense for non-HKSCS Big5 (which does not encode U+5E7A anywhere else), but it does not seem to make much sense for Big5-HKSCS either, which suggests that I might be missing something. 

>> On Fri Apr 6 06:42:26 PDT 2012, Philip J?genstedt <philipj at opera.com> wrote:
>> 
>>> Also, a single mapping fails the Big5-contra[di]ction test:
>>> 
>>> F9FE =>
>>> opera-hk: U+FFED ?
>>> firefox: U+2593 ?
>>> chrome: U+2593 ?
>>> firefox-hk: U+2593 ?
>>> opera: U+2593 ?
>>> chrome-hk: U+FFED ?
>>> internetexplorer: U+2593 ?
>>> hkscs-2008: <U+FFED> ?
>>> 
>>> I'd say that we should go with U+FFED here, since that's what the [HKSCS-2008] spec
>>> says and it's visually close anyway.
>> 
>> Given that the goal is to define a unified Big5 (non-HK) and Big5-HKSCS encoding and that this seems to be a case of the HK standard going against everything and everyone else, perhaps more weight should be given to existing specifications and (non-HK-specific) implementations.
>> 
>> Suggested change:  map F9FE to U+2593
> 
> This is the only mapping where IE maps something other than PUA or "?" that my mapping doesn't agree on, so I don't object to changing it. Still, it would be very interesting to know why HKSCS-2008 changed it, do you know?

No, I am afraid not.  I have been wondering as well, but I have not been able to find an explanation.

Lunde (if I remember correctly, 1st Edn) and Kano's 'Developing International Software' (1st Edn, 1995) both show something like U+2593, but it could of course be that popular non-Unicode (HK) Big5 fonts had glyphs more like U+FFED, which would make the HKSCS-2008 mapping less surprising.  Do let me know if you discover any information on this.

>> Duplicates and reverse mappings:
>> 
>> [...]
> 
> [...] it clearly needs to be defined what to do for these 100 code points that have multiple mappings to Big5. I extended my Python script to find these 100 duplicates and to check what Python did for 'big5', falling back to 'big5-hkscs'. This is what it produced:
> 
> [...]
> 
> These are the ones where you (?istein) disagree:
> 
>> C6CF <= U+5EF4
>> C6D3 <= U+65E0
>> C6D5 <= U+7676
>> C6D7 <= U+96B6
> 
> AFAICT this has nothing to do with compatibility mappings, so what's the reason for this?

As I wrote, '[o]nly these mappings will work for non-HK Big5 implementations.'  My reasoning was that a random Big5 implementation would be more likely to include the E-Ten 1 extension than the HKSCS extension.  On the other hand, these codepoints could be less than ideal if major Big5-HKSCS implementations follow the standard strictly and map to nothing.

>> F9E9 <= U+255E
>> F9EA <= U+256A
>> F9EB <= U+2561
>> F9F9 <= U+2550
> 
> Python's big5-hkscs agrees, but Python's big5 does this instead:
> 
> A2A5 <= U+255E
> A2A6 <= U+256A
> A2A7 <= U+2561
> A2A4 <= U+2550
> 
> It seems safer to go with the big5 mappings, but checking what browsers do would be helpful.

Does this imply that Python's big5 (non-HK) implementation does not include the corresponding E-Ten 2 (forward) mappings for decoding either?

F9E9 => U+255E
F9EA => U+256A
F9EB => U+2561
F9F9 => U+2550

My reasoning was that the Unicode characters are line-drawing characters, which is indisputably the case for the Big5 E-Ten 2 characters (F9xx) as well, whereas the intended use for the Big5 characters at A2xx is less clear (looking at a pre-Unicode Big5 fonts might be useful).  An additional reason for my suggestion is that the CP950 table in Kano's book leaves the four A2xx codepoints blank, which suggests that the F9xx positions are (or have been considered) preferable.

> How about the rest of my generated list, is that fine?

The remaining 84 characters are handled correctly, yes.  Here is the original HKSCS list: 

Compatibility        
point of HKSCS	     Unified With
==============       ============
8E69                 BAE6
8E6F                 EDCA
8E7E                 A261
8EAB                 BAFC
8EB4                 BFA6
8ECD                 AACC
8ED0                 BFAE
8F57                 B5D7
8F69                 E3C8
8F6E                 DB79
8FCB                 BFCC
8FCC                 A0D4
8FFE                 B05F
906D                 B3A3
907A                 F9D7
90DC                 C052
90F1                 C554
91BF                 F1E3
9244                 9242
92AF                 A259
92B0                 A25A
92B1                 A25C
92B2                 A25B
92C8                 A05F
92D1                 E6AB
9447                 D256
94CA                 E6D0
95D9                 CA52
9644                 9CE4
96ED                 96EE
96FC                 E959
9B76                 EFF9
9B78                 C5F7
9B7B                 F5E8
9BC6                 E8CD
9BDE                 D0C0
9BEC                 FD64
9BF6                 BF47
9C42                 EBC9
9C53                 CDE7
9C62                 C0E7
9C68                 DC52
9C6B                 F86D
9C77                 DB5D
9CBC                 C95C
9CBD                 AFB0
9CD0                 D4D1
9D57                 E07C
9D5A                 B5AE
9DC4                 A9E4
9EA9                 ABEC
9EEF                 DECD
9EFD                 C9FC
9F60                 F9C4
9F66                 91BE
9FCB                 B9B0
9FD8                 9361
A063                 8FB6
A077                 A9F0
A0D5                 947A
A0DF                 DE72
A0E4                 9455
FA5F                 ADC5
FA66                 B0B0
FABD                 A55D
FAC5                 A2CD
FAD5                 ADEB
FB48                 9DEF
FBB8                 B440
FBF3                 C9DB
FBF9                 9DFB
FC4F                 D8F4
FC6C                 A0DC
FCB9                 BCB5
FCE2                 B4B8
FCF1                 A7FB
FDB7                 CB58
FDB8                 B4FC
FDBB                 B4E4
FDF1                 B54E
FE52                 9975
FE6F                 B7EC
FEAA                 A260
FEDD                 CFF1

>> On Fri Apr 6 14:03:22 PDT 2012, Philip J?genstedt <philipj at opera.com> wrote:
>> 
>>> There are 29 mappings to U+003F (?) in IE that no other browser has.
>> 
>> Are you referring to the ones at A3E2--A3FE?  [...]
> 
> Yes, that's the range. I think we should leave these undefined.

Agreed.

> [...]
> 
> I generated <http://people.opera.com/philipj/2012/04/08/big5-undefined-ie.txt> and had a look using various Chinese fonts in Windows 7. It looks like most fonts have a copy of the printable ASCII characters in U+F020 through U+F07E, and what looks like parts of windows-1252 or latin-1 up to U+F0FF.
> 
> Exactly the 22 codepoints you list *are* Han characters in the MingLiu_HKSCS font, see <http://people.opera.com/philipj/2012/04/08/big5-mingliu-hkscs.png>. Presumably they were not in Unicode when HKSCS-2008 was defined, but if they have been added since I think we should simply map them. Unfortunately, I haven't been able to find them by searching by radicals in the Unihan database...

Perhaps they are 'Phantom Ideographs' (Lunde), i.e., 'ideographs whose origins cannot be determined' and which will not be added to Unicode today, in which case there is not much we can do (but such characters are unlikely to see much use anyway).

-- 
?istein E. Andersen
Received on Sunday, 8 April 2012 18:08:20 UTC