- From: Philip Jägenstedt <philipj@opera.com>
- Date: Sun, 08 Apr 2012 19:03:58 +0200
On Sat, 07 Apr 2012 16:04:55 +0200, ?istein E. Andersen <liszt at coq.no> wrote: > On Fri Apr 6 14:03:22 PDT 2012, Philip J?genstedt <philipj at opera.com> > wrote: > >> So, <http://people.opera.com/philipj/2012/04/06/big5-foolip.txt> is the >> mapping I suggest, with 18594 defined mappings and 1188 U+FFFD. > > (Second byte 0xA1 appears as 0x7F in the mapping file.) Oops, I blame Anne's get_bytes() which had an off-by-one error. This is the corrected version: def get_bytes(index): row = 0xFE-0xA1 + RANGE + 1 lead = (index // row) + 0x81 cell = index % row trail = (cell + 0xA1 - RANGE) if cell >= RANGE else cell + 0x40 return (lead, trail) I've also updated big5-foolip.txt with this fix. > Your table is very similar to my idea of an ideal Big5-HKSCS decoder, > which follows Unihan, official Big5-HKSCS mappings and real > implementations closely. My version is documented in detail at > <http://coq.no/character-tables/chinese-traditional/en> and summarised > at <http://coq.no/character-tables/charset5/en>. > > Only 2 mappings differ, viz, C6CD (U+2F33 v. U+5E7A) and F9FE (U+FFED v. > U+2593), which is quite reassuring given that we have worked > independently and used somewhat different approaches. The two divergent > mappings and related issues are discussed below. Yes, it's very encouraging indeed that we got this close independently and with different methods! >> [in summary:] >> C6CF => U+5EF4 ? [v.] U+2F35 ? >> C6D3 => U+65E0 ? [v.] U+2F46 ? >> C6D5 => U+7676 ? [v.] U+2F68 ? >> C6D7 => U+96B6 ? [v.] U+2FAA ? >> C6DE => U+3003 ?[v.] U+F6EE [PUA] >> C6DF => U+4EDD ? [v.] U+F6EF [PUA] > > To this list can be added: > > C6CD => U+5E7A v. U+2F33 > > These seven characters are all part of the E-Ten 1 extension [1] to > Big5, which is included in all implementations of Big5 (with or without > HK extensions) that I have come across in browsers . The official > Big5-HKSCS table includes the E-Ten 1 extension as well, but the seven > characters listed above appear elsewhere in the HKSCS extensions and are > handled specially to avoid encoding the same character more than once. > > [1] <http://coq.no/character-tables/eten1.pdf> > <http://coq.no/character-tables/eten1.js> What is the source for the mappings in eten1.pdf? I assume that E-Ten was originally just some Big5 fonts with no defined mappings to Unicode? > Five are Kangxi radicals and encoded twice in Unicode (once as radicals, > once as normal Han characters). The official Big5-HKSCS table maps C6CD > to the Unicode Kangxi radical U+2F33 and does not list the remaining > four codepoints at all. U+2F33 is the only Unicode Kangxi radical > included in the official Big5-HKSCS table. As you have noticed, some > Big5-HKSCS implementations follow this idea for the remaining four as > well. It seems better to follow non-HK Big5 implementations here and > map all five to normal Unicode Han characters. > > For the last two characters, U+3003 and U+4EDD, there is only one > possible Unicode mapping, so duplicates are impossible to avoid (without > using PUA characters). The official Big5-HKSCS table does not map C6DE > and C6DF to anything. > > Suggested change: map C6CD to U+5E7A. These are the existing mappings: C6CD => opera-hk: U+2F33 ? firefox: U+5E7A ? chrome: U+F6DD ? firefox-hk: U+5E7A ? opera: U+2F33 ? chrome-hk: U+2F33 ? internetexplorer: U+F6DD ? hkscs-2008: <U+2F33> ? At least on the Web, this isn't a question of HK vs non-HK mappings. Other than Firefox, which (de-facto) specs or implementations use U+5E7A? Possibly, one could argue that U+2F33 normalizes (NFKC) to U+5E7A, but it's not the only hanzi in HKSCS-2008 that normalizes into something else: 8BC3 => <U+2F878> ? => <U+5C6E> ? 8BF8 => <U+F907> ? => <U+9F9C> ? 8EFD => <U+2F994> ? => <U+82B3> ? 8FA8 => <U+2F9B2> ? => <U+456B> ? 8FF0 => <U+2F9D4> ? => <U+8CAB> ? C6CD => <U+2F33> ? => <U+5E7A> ? 957A => <U+2F9BC> ? => <U+8728> ? 9874 => <U+2F825> ? => <U+52C7> ? 9AC8 => <U+2F83B> ? => <U+5406> ? 9C52 => <U+2F8CD> ? => <U+6649> ? A047 => <U+2F840> ? => <U+54A2> ? FC48 => <U+2F894> ? => <U+5F22> ? FC77 => <U+2F8A6> ? => <U+6148> ? I'm not sure what the conclusion is... > On Fri Apr 6 06:42:26 PDT 2012, Philip J?genstedt <philipj at opera.com> > wrote: > >> Also, a single mapping fails the Big5-contra[di]ction test: >> >> F9FE => >> opera-hk: U+FFED ? >> firefox: U+2593 ? >> chrome: U+2593 ? >> firefox-hk: U+2593 ? >> opera: U+2593 ? >> chrome-hk: U+FFED ? >> internetexplorer: U+2593 ? >> hkscs-2008: <U+FFED> ? >> >> I'd say that we should go with U+FFED here, since that's what the >> [HKSCS-2008] spec >> says and it's visually close anyway. > > Given that the goal is to define a unified Big5 (non-HK) and Big5-HKSCS > encoding and that this seems to be a case of the HK standard going > against everything and everyone else, perhaps more weight should be > given to existing specifications and (non-HK-specific) implementations. > > Suggested change: map F9FE to U+2593 This is the only mapping where IE maps something other than PUA or "?" that my mapping doesn't agree on, so I don't object to changing it. Still, it would be very interesting to know why HKSCS-2008 changed it, do you know? > Duplicates and reverse mappings: > > big5-foolip.txt currently provides two different codepoints for 100 > Unicode characters. > > 6 are mentioned above. 84 result from compatibility mappings defined in > the official HKSCS-2008 specification, cf. [2]. This leaves 10: > > [2] <http://coq.no/character-tables/o-h-comp.pdf> > <http://coq.no/character-tables/o-h-comp.js> > > U+5341 (?, 'ten') and U+5345 (? 'thirty') are encoded twice in Big5, > once as numerals and once as standard Han characters. (U+5344 ? > 'twenty' is only encoded once in Big5, but was added to HKSCS and is now > one of the 84 compatibility mappings.) > > According to Lunde, the four codepoints F9FA--F9FD (ETen-2 extension) > are supposed to encode double-stroked circle segments which appear to be > missing from Unicode (I am not sure whether they have ever been proposed > for inclusion). They are currently mapped to single-stroked circle > segments instead, but those are already encoded at A2E7, A2A1--A2A3 > (original Big5). > > The four codepoints F9F9, F9E9, F9EB, F9EA (ETen-2 extension) encode > line-drawing characters with a double horizontal line. These appear to > be encoded at A2A4--A2A7 (original Big5) already, and it is not clear to > me whether the characters at A2A4--A2A7 are supposed to be different or > whether ETen chose to encode them again to have a full set of > line-drawing characters in one location. > > Suggested reverse mappings: > > C6CF <= U+5EF4 > C6D3 <= U+65E0 > C6D5 <= U+7676 > C6D7 <= U+96B6 > C6DE <= U+3003 > C6DF <= U+4EDD > C6CD <= 5E7A (if the mapping of C6CD is changed) > Rationale: Only these mappings will work for non-HK Big5 > implementations, and these characters appear to be important not only in > Hong Kong. > > A451 <= U+5341 > A4CA <= U+5345 > > A27E <= U+256D > A2A1<= U+256E > A2A3 <= U+256F > A2A2 <= U+2570 > > F9F9 <= U+2550 > F9E9 <= U+255E > F9EB <= U+2561 > F9EA <= U+256A > > (The 84 compatibility mappings should obviously only be used to decode > and never as reverse mappings.) Anne, how do you plan to define encoders for tables with duplicate mappings? Have you collected data for what browsers currently do? In any event, it clearly needs to be defined what to do for these 100 code points that have multiple mappings to Big5. I extended my Python script to find these 100 duplicates and to check what Python did for 'big5', falling back to 'big5-hkscs'. This is what it produced: 8FB6 <= U+880F 90C4 <= U+96B6 91BE <= U+9F17 9242 <= U+8503 9361 <= U+5F0C 9455 <= U+7250 947A <= U+7468 96EE <= U+701E 9975 <= U+732A 9CE4 <= U+975D 9DEF <= U+5605 9DFB <= U+5ED0 A05F <= U+936E A0D4 <= U+89A9 A0DC <= U+60A4 A1B2 <= U+3003 A259 <= U+5159 A25A <= U+515B A25B <= U+515E A25C <= U+515D A260 <= U+74E9 A261 <= U+7CCE A27E <= U+256D A2A1 <= U+256E A2A2 <= U+2570 A2A3 <= U+256F A2A4 <= U+2550 A2A5 <= U+255E A2A6 <= U+256A A2A7 <= U+2561 A2CD <= U+5344 A451 <= U+5341 A4CA <= U+5345 A55D <= U+5305 A7FB <= U+675E A9E4 <= U+62D0 A9F0 <= U+62CE AACC <= U+8005 ABEC <= U+6062 ADC5 <= U+5029 ADEB <= U+537F AFB0 <= U+79E3 B05F <= U+8D77 B0B0 <= U+507D B3A3 <= U+90FD B440 <= U+5A77 B4B8 <= U+6674 B4E4 <= U+6E2F B4FC <= U+6E1D B54E <= U+716E B5AE <= U+7B51 B5D7 <= U+83C1 B7EC <= U+745C B9B0 <= U+50ED BAE6 <= U+7BB8 BAFC <= U+7DD2 BCB5 <= U+6490 BF47 <= U+6FB6 BFA6 <= U+7E1D BFAE <= U+8028 BFCC <= U+89A6 C052 <= U+975C C0E7 <= U+71DF C554 <= U+97FF C5F7 <= U+77D7 C95C <= U+5C10 C969 <= U+4EDD C9DB <= U+5E75 C9FC <= U+6C4A CA52 <= U+9097 CB58 <= U+6C9C CDE7 <= U+4FBB CFF1 <= U+7809 D0C0 <= U+91D4 D256 <= U+6D67 D4D1 <= U+5A67 D8F4 <= U+5F58 DB5D <= U+83CF DB79 <= U+840F DC52 <= U+9104 DE72 <= U+7162 DECD <= U+75F9 E07C <= U+8F0B E3C8 <= U+84A8 E6AB <= U+7479 E6D0 <= U+799B E8CD <= U+99D6 E959 <= U+5B28 EBC9 <= U+8F36 EDCA <= U+7C06 EFF9 <= U+7201 F1E3 <= U+9F16 F5E8 <= U+7E87 F86D <= U+9DF0 F9C4 <= U+9B2E F9D7 <= U+92B9 FBFD <= U+5EF4 FCD3 <= U+65E0 FD64 <= U+60DE FEC1 <= U+7676 These are the ones where you (?istein) disagree: > C6CF <= U+5EF4 > C6D3 <= U+65E0 > C6D5 <= U+7676 > C6D7 <= U+96B6 AFAICT this has nothing to do with compatibility mappings, so what's the reason for this? > F9E9 <= U+255E > F9EA <= U+256A > F9EB <= U+2561 > F9F9 <= U+2550 Python's big5-hkscs agrees, but Python's big5 does this instead: A2A5 <= U+255E A2A6 <= U+256A A2A7 <= U+2561 A2A4 <= U+2550 It seems safer to go with the big5 mappings, but checking what browsers do would be helpful. How about the rest of my generated list, is that fine? > On Fri Apr 6 14:03:22 PDT 2012, Philip J?genstedt <philipj at opera.com> > wrote: > >> There are 29 mappings to U+003F (?) in IE that no other browser has. > > Are you referring to the ones at A3E2--A3FE? IE decodes (or used to > decode) the control pictures at A3C0--A3E0 as C0 control characters in > plain text, but replace(s) them with question marks in HTML. It looks > like this treatment has been extended to the the remaining A3xx > codepoints (after the euro), perhaps without a good reason. Yes, that's the range. I think we should leave these undefined. >> The remaining mappings are to PUA or U+FFFD in all browsers [...]. >> Mapping >> these to U+FFFD unless anyone finds pages using these byte sequences >> seems >> the only sane option. > > Agreed. Do any of these ever render in a meaningful way (e.g., in IE on > a Windows machine with HK locale and appropriate HKSCS PUA fonts)? > > The following 22 codepoints are 'reserved for backwards compatibility' > in the HKSCS-2008 standard, but no Unicode mappings are provided: > > 9EAC > 9EC4 > 9EF4 > 9F4E > 9FAD > 9FB1 > 9FC0 > 9FC8 > 9FDA > 9FE6 > 9FEA > 9FEF > A054 > A057 > A05A > A062 > A072 > A0A5 > A0AD > A0AF > A0D3 > A0E1 > > I assume some systems will render at least these as potentially > meaningful Han characters. I generated <http://people.opera.com/philipj/2012/04/08/big5-undefined-ie.txt> and had a look using various Chinese fonts in Windows 7. It looks like most fonts have a copy of the printable ASCII characters in U+F020 through U+F07E, and what looks like parts of windows-1252 or latin-1 up to U+F0FF. Exactly the 22 codepoints you list *are* Han characters in the MingLiu_HKSCS font, see <http://people.opera.com/philipj/2012/04/08/big5-mingliu-hkscs.png>. Presumably they were not in Unicode when HKSCS-2008 was defined, but if they have been added since I think we should simply map them. Unfortunately, I haven't been able to find them by searching by radicals in the Unihan database... -- Philip J?genstedt Core Developer Opera Software
Received on Sunday, 8 April 2012 10:03:58 UTC