[whatwg] Encoding: big5 and big5-hkscs

On Sat, 07 Apr 2012 16:04:55 +0200, ?istein E. Andersen <liszt at coq.no>  
wrote:

> On Fri Apr 6 14:03:22 PDT 2012, Philip J?genstedt <philipj at opera.com>  
> wrote:
>
>> So, <http://people.opera.com/philipj/2012/04/06/big5-foolip.txt> is the
>> mapping I suggest, with 18594 defined mappings and 1188 U+FFFD.
>
> (Second byte 0xA1 appears as 0x7F in the mapping file.)

Oops, I blame Anne's get_bytes() which had an off-by-one error. This is  
the corrected version:

def get_bytes(index):
     row = 0xFE-0xA1 + RANGE + 1
     lead = (index // row) + 0x81
     cell = index % row
     trail = (cell + 0xA1 - RANGE) if cell >= RANGE else cell + 0x40
     return (lead, trail)

I've also updated big5-foolip.txt with this fix.

> Your table is very similar to my idea of an ideal Big5-HKSCS decoder,  
> which follows Unihan, official Big5-HKSCS mappings and real  
> implementations closely.  My version is documented in detail at  
> <http://coq.no/character-tables/chinese-traditional/en> and summarised  
> at <http://coq.no/character-tables/charset5/en>.
>
> Only 2 mappings differ, viz, C6CD (U+2F33 v. U+5E7A) and F9FE (U+FFED v.  
> U+2593), which is quite reassuring given that we have worked  
> independently and used somewhat different approaches.  The two divergent  
> mappings and related issues are discussed below.

Yes, it's very encouraging indeed that we got this close independently and  
with different methods!

>> [in summary:]
>> C6CF => U+5EF4 ? [v.] U+2F35 ?
>> C6D3 => U+65E0 ? [v.] U+2F46 ?
>> C6D5 => U+7676 ? [v.] U+2F68 ?
>> C6D7 => U+96B6 ? [v.] U+2FAA ?
>> C6DE => U+3003 ?[v.] U+F6EE [PUA]
>> C6DF => U+4EDD ? [v.]  U+F6EF [PUA]
>
> To this list can be added:
>
> C6CD => U+5E7A v. U+2F33
>
> These seven characters are all part of the E-Ten 1 extension [1] to  
> Big5, which is included in all implementations of Big5 (with or without  
> HK extensions) that I have come across in browsers .  The official  
> Big5-HKSCS table includes the E-Ten 1 extension as well, but the seven  
> characters listed above appear elsewhere in the HKSCS extensions and are  
> handled specially to avoid encoding the same character more than once.
>
> 	[1] <http://coq.no/character-tables/eten1.pdf>   
> <http://coq.no/character-tables/eten1.js>

What is the source for the mappings in eten1.pdf? I assume that E-Ten was  
originally just some Big5 fonts with no defined mappings to Unicode?

> Five are Kangxi radicals and encoded twice in Unicode (once as radicals,  
> once as normal Han characters).  The official Big5-HKSCS table maps C6CD  
> to the Unicode Kangxi radical U+2F33 and does not list the remaining  
> four codepoints at all.  U+2F33 is the only Unicode Kangxi radical  
> included in the official Big5-HKSCS table.  As you have noticed, some  
> Big5-HKSCS implementations follow this idea for the remaining four as  
> well.  It seems better to follow non-HK Big5 implementations here and  
> map all five to normal Unicode Han characters.
>
> For the last two characters, U+3003 and U+4EDD, there is only one  
> possible Unicode mapping, so duplicates are impossible to avoid (without  
> using PUA characters).  The official Big5-HKSCS table does not map C6DE  
> and C6DF to anything.
>
> Suggested change:  map C6CD to U+5E7A.

These are the existing mappings:

C6CD =>
opera-hk: U+2F33 ?
firefox: U+5E7A ?
chrome: U+F6DD ?
firefox-hk: U+5E7A ?
opera: U+2F33 ?
chrome-hk: U+2F33 ?
internetexplorer: U+F6DD ?
hkscs-2008: <U+2F33> ?

At least on the Web, this isn't a question of HK vs non-HK mappings. Other  
than Firefox, which (de-facto) specs or implementations use U+5E7A?

Possibly, one could argue that U+2F33 normalizes (NFKC) to U+5E7A, but  
it's not the only hanzi in HKSCS-2008 that normalizes into something else:

8BC3 => <U+2F878> ? => <U+5C6E> ?
8BF8 => <U+F907> ? => <U+9F9C> ?
8EFD => <U+2F994> ? => <U+82B3> ?
8FA8 => <U+2F9B2> ? => <U+456B> ?
8FF0 => <U+2F9D4> ? => <U+8CAB> ?
C6CD => <U+2F33> ? => <U+5E7A> ?
957A => <U+2F9BC> ? => <U+8728> ?
9874 => <U+2F825> ? => <U+52C7> ?
9AC8 => <U+2F83B> ? => <U+5406> ?
9C52 => <U+2F8CD> ? => <U+6649> ?
A047 => <U+2F840> ? => <U+54A2> ?
FC48 => <U+2F894> ? => <U+5F22> ?
FC77 => <U+2F8A6> ? => <U+6148> ?

I'm not sure what the conclusion is...

> On Fri Apr 6 06:42:26 PDT 2012, Philip J?genstedt <philipj at opera.com>  
> wrote:
>
>> Also, a single mapping fails the Big5-contra[di]ction test:
>>
>> F9FE =>
>> opera-hk: U+FFED ?
>> firefox: U+2593 ?
>> chrome: U+2593 ?
>> firefox-hk: U+2593 ?
>> opera: U+2593 ?
>> chrome-hk: U+FFED ?
>> internetexplorer: U+2593 ?
>> hkscs-2008: <U+FFED> ?
>>
>> I'd say that we should go with U+FFED here, since that's what the  
>> [HKSCS-2008] spec
>> says and it's visually close anyway.
>
> Given that the goal is to define a unified Big5 (non-HK) and Big5-HKSCS  
> encoding and that this seems to be a case of the HK standard going  
> against everything and everyone else, perhaps more weight should be  
> given to existing specifications and (non-HK-specific) implementations.
>
> Suggested change:  map F9FE to U+2593

This is the only mapping where IE maps something other than PUA or "?"  
that my mapping doesn't agree on, so I don't object to changing it. Still,  
it would be very interesting to know why HKSCS-2008 changed it, do you  
know?

> Duplicates and reverse mappings:
>
> big5-foolip.txt currently provides two different codepoints for 100  
> Unicode characters.
>
> 6 are mentioned above.  84 result from compatibility mappings defined in  
> the official HKSCS-2008 specification, cf. [2].  This leaves 10:
>
> 	[2] <http://coq.no/character-tables/o-h-comp.pdf>  
> <http://coq.no/character-tables/o-h-comp.js>
>
> U+5341 (?, 'ten') and U+5345 (? 'thirty') are encoded twice in Big5,  
> once as numerals and once as standard Han characters.  (U+5344 ?  
> 'twenty' is only encoded once in Big5, but was added to HKSCS and is now  
> one of the 84 compatibility mappings.)
>
> According to Lunde, the four codepoints F9FA--F9FD (ETen-2 extension)  
> are supposed to encode double-stroked circle segments which appear to be  
> missing from Unicode (I am not sure whether they have ever been proposed  
> for inclusion).  They are currently mapped to single-stroked circle  
> segments instead, but those are already encoded at A2E7, A2A1--A2A3  
> (original Big5).
>
> The four codepoints F9F9, F9E9, F9EB, F9EA (ETen-2 extension) encode  
> line-drawing characters with a double horizontal line.  These appear to  
> be encoded at A2A4--A2A7 (original Big5) already, and it is not clear to  
> me whether the characters at A2A4--A2A7 are supposed to be different or  
> whether ETen chose to encode them again to have a full set of  
> line-drawing characters in one location.
>
> Suggested reverse mappings:
>
> C6CF <= U+5EF4
> C6D3 <= U+65E0
> C6D5 <= U+7676
> C6D7 <= U+96B6
> C6DE <= U+3003
> C6DF <= U+4EDD
> C6CD <= 5E7A (if the mapping of C6CD is changed)
> Rationale:  Only these mappings will work for non-HK Big5  
> implementations, and these characters appear to be important not only in  
> Hong Kong.
>
> A451 <= U+5341
> A4CA <= U+5345
>
> A27E <= U+256D
> A2A1<= U+256E
> A2A3 <= U+256F
> A2A2 <= U+2570
>
> F9F9 <= U+2550
> F9E9 <= U+255E
> F9EB <= U+2561
> F9EA <= U+256A
>
> (The 84 compatibility mappings should obviously only be used to decode  
> and never as reverse mappings.)

Anne, how do you plan to define encoders for tables with duplicate  
mappings? Have you collected data for what browsers currently do?

In any event, it clearly needs to be defined what to do for these 100 code  
points that have multiple mappings to Big5. I extended my Python script to  
find these 100 duplicates and to check what Python did for 'big5', falling  
back to 'big5-hkscs'. This is what it produced:

8FB6 <= U+880F
90C4 <= U+96B6
91BE <= U+9F17
9242 <= U+8503
9361 <= U+5F0C
9455 <= U+7250
947A <= U+7468
96EE <= U+701E
9975 <= U+732A
9CE4 <= U+975D
9DEF <= U+5605
9DFB <= U+5ED0
A05F <= U+936E
A0D4 <= U+89A9
A0DC <= U+60A4
A1B2 <= U+3003
A259 <= U+5159
A25A <= U+515B
A25B <= U+515E
A25C <= U+515D
A260 <= U+74E9
A261 <= U+7CCE
A27E <= U+256D
A2A1 <= U+256E
A2A2 <= U+2570
A2A3 <= U+256F
A2A4 <= U+2550
A2A5 <= U+255E
A2A6 <= U+256A
A2A7 <= U+2561
A2CD <= U+5344
A451 <= U+5341
A4CA <= U+5345
A55D <= U+5305
A7FB <= U+675E
A9E4 <= U+62D0
A9F0 <= U+62CE
AACC <= U+8005
ABEC <= U+6062
ADC5 <= U+5029
ADEB <= U+537F
AFB0 <= U+79E3
B05F <= U+8D77
B0B0 <= U+507D
B3A3 <= U+90FD
B440 <= U+5A77
B4B8 <= U+6674
B4E4 <= U+6E2F
B4FC <= U+6E1D
B54E <= U+716E
B5AE <= U+7B51
B5D7 <= U+83C1
B7EC <= U+745C
B9B0 <= U+50ED
BAE6 <= U+7BB8
BAFC <= U+7DD2
BCB5 <= U+6490
BF47 <= U+6FB6
BFA6 <= U+7E1D
BFAE <= U+8028
BFCC <= U+89A6
C052 <= U+975C
C0E7 <= U+71DF
C554 <= U+97FF
C5F7 <= U+77D7
C95C <= U+5C10
C969 <= U+4EDD
C9DB <= U+5E75
C9FC <= U+6C4A
CA52 <= U+9097
CB58 <= U+6C9C
CDE7 <= U+4FBB
CFF1 <= U+7809
D0C0 <= U+91D4
D256 <= U+6D67
D4D1 <= U+5A67
D8F4 <= U+5F58
DB5D <= U+83CF
DB79 <= U+840F
DC52 <= U+9104
DE72 <= U+7162
DECD <= U+75F9
E07C <= U+8F0B
E3C8 <= U+84A8
E6AB <= U+7479
E6D0 <= U+799B
E8CD <= U+99D6
E959 <= U+5B28
EBC9 <= U+8F36
EDCA <= U+7C06
EFF9 <= U+7201
F1E3 <= U+9F16
F5E8 <= U+7E87
F86D <= U+9DF0
F9C4 <= U+9B2E
F9D7 <= U+92B9
FBFD <= U+5EF4
FCD3 <= U+65E0
FD64 <= U+60DE
FEC1 <= U+7676

These are the ones where you (?istein) disagree:

> C6CF <= U+5EF4
> C6D3 <= U+65E0
> C6D5 <= U+7676
> C6D7 <= U+96B6

AFAICT this has nothing to do with compatibility mappings, so what's the  
reason for this?

> F9E9 <= U+255E
> F9EA <= U+256A
> F9EB <= U+2561
> F9F9 <= U+2550

Python's big5-hkscs agrees, but Python's big5 does this instead:

A2A5 <= U+255E
A2A6 <= U+256A
A2A7 <= U+2561
A2A4 <= U+2550

It seems safer to go with the big5 mappings, but checking what browsers do  
would be helpful.

How about the rest of my generated list, is that fine?

> On Fri Apr 6 14:03:22 PDT 2012, Philip J?genstedt <philipj at opera.com>  
> wrote:
>
>> There are 29 mappings to U+003F (?) in IE that no other browser has.
>
> Are you referring to the ones at A3E2--A3FE?  IE decodes (or used to  
> decode) the control pictures at A3C0--A3E0 as C0 control characters in  
> plain text, but replace(s) them with question marks in HTML.  It looks  
> like this treatment has been extended to the the remaining A3xx  
> codepoints (after the euro), perhaps without a good reason.

Yes, that's the range. I think we should leave these undefined.

>> The remaining mappings are to PUA or U+FFFD in all browsers [...].  
>> Mapping
>> these to U+FFFD unless anyone finds pages using these byte sequences  
>> seems
>> the only sane option.
>
> Agreed.  Do any of these ever render in a meaningful way (e.g., in IE on  
> a Windows machine with HK locale and appropriate HKSCS PUA fonts)?
>
> The following 22 codepoints are 'reserved for backwards compatibility'  
> in the HKSCS-2008 standard, but no Unicode mappings are provided:
>
> 9EAC
> 9EC4
> 9EF4
> 9F4E
> 9FAD
> 9FB1
> 9FC0
> 9FC8
> 9FDA
> 9FE6
> 9FEA
> 9FEF
> A054
> A057
> A05A
> A062
> A072
> A0A5
> A0AD
> A0AF
> A0D3
> A0E1
>
> I assume some systems will render at least these as potentially  
> meaningful Han characters.

I generated  
<http://people.opera.com/philipj/2012/04/08/big5-undefined-ie.txt> and had  
a look using various Chinese fonts in Windows 7. It looks like most fonts  
have a copy of the printable ASCII characters in U+F020 through U+F07E,  
and what looks like parts of windows-1252 or latin-1 up to U+F0FF.

Exactly the 22 codepoints you list *are* Han characters in the  
MingLiu_HKSCS font, see  
<http://people.opera.com/philipj/2012/04/08/big5-mingliu-hkscs.png>.  
Presumably they were not in Unicode when HKSCS-2008 was defined, but if  
they have been added since I think we should simply map them.  
Unfortunately, I haven't been able to find them by searching by radicals  
in the Unihan database...

-- 
Philip J?genstedt
Core Developer
Opera Software

Received on Sunday, 8 April 2012 10:03:58 UTC