gbk and gb18030 double byte data update

A year and a half ago I compiled
http://lists.w3.org/Archives/Public/www-archive/2012Apr/0030.html using
http://dump.testsuite.org/encoding/gbk/ and some basic Python scripts to
analyze the output. (Lack of Internet Explorer is due to lack of
XMLHttpRequest's overrideMimeType support there.)

However, gb18030 is supposed to be a UTF and in Rebel Opera it is not. Back
then I did not take this as a hard requirement, but it leads to problems
such as https://www.w3.org/Bugs/Public/show_bug.cgi?id=21145 and might in
fact violate some Chinese government regulations depending on who you ask.

gb18030 data is the same between Gecko and Chrome. Where gbk differs from
gb18030 in Gecko, the byte sequences is mapped to U+FFFD. In Chrome, a PUA
mapping is used instead, as illustrated below:

Index           Chrome
       gb18030  gbk
 6432  20AC     E76C
 7536  01F9     E7C8
 7672  303E     E7E7
 7673  2FF0     E7E8
 7674  2FF1     E7E9
 7675  2FF2     E7EA
 7676  2FF3     E7EB
 7677  2FF4     E7EC
 7678  2FF5     E7ED
 7679  2FF6     E7EE
 7680  2FF7     E7EF
 7681  2FF8     E7F0
 7682  2FF9     E7F1
 7683  2FFA     E7F2
 7684  2FFB     E7F3
23766  2E81     E815
23770  2E84     E819
23771  3473     E81A
23772  3447     E81B
23773  2E88     E81C
23774  2E8B     E81D
23776  359E     E81F
23777  361A     E820
23778  360E     E821
23779  2E8C     E822
23780  2E97     E823
23781  396E     E824
23782  3918     E825
23784  39CF     E827
23785  39DF     E828
23786  3A73     E829
23787  39D0     E82A
23790  3B4E     E82D
23791  3C6E     E82E
23792  3CE0     E82F
23793  2EA7     E830
23796  2EAA     E833
23797  4056     E834
23798  415F     E835
23799  2EAE     E836
23800  4337     E837
23801  2EB3     E838
23802  2EB6     E839
23803  2EB7     E83A
23805  43B1     E83C
23806  43AC     E83D
23807  2EBB     E83E
23808  43DD     E83F
23809  44D6     E840
23810  4661     E841
23811  464C     E842
23813  4723     E844
23814  4729     E845
23815  477C     E846
23816  478D     E847
23817  2ECA     E848
23818  4947     E849
23819  497A     E84A
23820  497D     E84B
23821  4982     E84C
23822  4983     E84D
23823  4985     E84E
23824  4986     E84F
23825  499F     E850
23826  499B     E851
23827  49B7     E852
23828  49B6     E853
23831  4CA3     E856
23832  4C9F     E857
23833  4CA0     E858
23834  4CA1     E859
23835  4C77     E85A
23836  4CA2     E85B
23837  4D13     E85C
23838  4D14     E85D
23839  4D15     E85E
23840  4D16     E85F
23841  4D17     E860
23842  4D18     E861
23843  4D19     E862
23844  4DAE     E863

Given the differences among browsers for these 81 mappings it seems safe to
use the gb18030 mapping universally and even turn gbk into a label for
gb18030.

Note that the indexes are in line with what
http://encoding.spec.whatwg.org/is using.


-- 
http://annevankesteren.nl/

Received on Monday, 16 December 2013 15:49:05 UTC