W3C home > Mailing lists > Public > whatwg@whatwg.org > April 2012

[whatwg] Encoding: big5 and big5-hkscs

From: Philip Jägenstedt <philipj@opera.com>
Date: Fri, 06 Apr 2012 15:42:26 +0200
Message-ID: <op.wcci00uesr6mfa@localhost.localdomain>
On Fri, 06 Apr 2012 12:54:53 +0200, Philip J?genstedt <philipj at opera.com>  
wrote:

> As a starting point for the spec, I suggest taking the intersection of  
> opera-hk, firefox-hk and chrome-hk.

I've written a script in <https://gitorious.org/whatwg/big5> to generate  
the mapping that I think makes sense. This is the logic used:

1. If all 3 *-hk mappings agree, use that.
2. If 2 of the *-hk mappings agree on something that is not in the PUA and  
not U+FFFD, use that.
3. If HKSCS-2008 [1] defines a mapping, verify that at least 1 *-hk  
mapping agrees and use that.

Finally, check that the resulting spec does not use the PUA, U+FFFD or  
contradicts a Big5 mapping that everybody agrees on.

This yields a mapping for 18583 of 19782 combinations, which I propose as  
a starting point. To this I would add these 4 mappings from HKSCS-2008,  
which uses multiple code points to represent what was previously a single  
code point in the PUA in some browsers:

8862 => <U+00CA,U+0304> ??
8864 => <U+00CA,U+030C> ??
88A3 => <U+00EA,U+0304> ??
88A5 => <U+00EA,U+030C> ??

Also, a single mapping fails the Big5-contraction test:

F9FE =>
opera-hk: U+FFED ?
firefox: U+2593 ?
chrome: U+2593 ?
firefox-hk: U+2593 ?
opera: U+2593 ?
chrome-hk: U+FFED ?
internetexplorer: U+2593 ?
hkscs-2008: <U+FFED> ?

I'd say that we should go with U+FFED here, since that's what the spec  
says and it's visually close anyway.

These are the ranges that need more investigation.

8140-817F, 81A2-81FE, 8240-827F, 82A2-82FE, 8340-837F, 83A2-83FE,  
8440-847F, 84A2-84FE, 8540-857F, 85A2-85FE, 8640-867F, 86A2-86FE, 8766,  
87E0-87FE, 8862, 8864, 88A3, 88A5, 88AB-88FE, 8942, 8944-8945, 894A-894B,  
89A7-89AA, 89AF, 89B3-89B4, 89C0, 89C4, 8A42, 8A63, 8A75, 8AAB, 8AB1,  
8ABA, 8AC8, 8ACD, 8ADD-8ADE, 8AF5, 8B54, 8BDD, 8BFE, 8CA6, 8CC6-8CC8,  
8CCD, 8CE5, 8D41, 9B61, 9EAC, 9EC4, 9EF4, 9F4E, 9FAD, 9FB1, 9FC0, 9FC8,  
9FDA, 9FE6, 9FEA, 9FEF, A054, A057, A05A, A062, A072, A0A5, A0AD, A0AF,  
A0D3, A0E1, A3E2-A3FE, C6CF, C6D3, C6D5, C6D7, C6DE-C6DF, C8A5-C8CC,  
C8F2-C8F4

They all map to U+FFFD in opera-hk and mostly to PUA points in other  
mappings. A lot of them should probably be U+FFFD, but not all of them. Is  
someone (Simon?) able to do a search for existing content labeled as Big5  
or Big5-HKSCS that uses any of these bytes?

[1]  
http://www.ogcio.gov.hk/en/business/tech_promotion/ccli/download_area/mapping_table_2008.htm

-- 
Philip J?genstedt
Core Developer
Opera Software
Received on Friday, 6 April 2012 06:42:26 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 30 January 2013 18:48:07 GMT