[whatwg/encoding] Differences between tests and specification (#169)

I'm implementing the encoding tests in C to test my encoding functions. I parsed the results from [Summarized test results: Encoding, double-byte](https://www.w3.org/International/tests/repo/results/encoding-dbl-byte.en) into C arrays to have static data I can test against. By the way it would be great to have such lists as plain text file, easy to parse, in the git repository.

Several of the tests returned errors, and after verifying the results by hand, it seems that some expected tests results differ from what the spec algorithms produce. I may be wrong or have missed some things, but here is the list on the problems I encountered. When I say "should" I mean what is expected if you follow the spec.

gbk: I believe those problems are related to the gb18030:2005 version of the encoding.

encoder: encoding 0xe7c7 should return error on [step 7 of the gb18030 encoder](https://encoding.spec.whatwg.org/#gb18030-encoder) since 0xe7c7 is not in the index and the gbk flag is set.

decoder: decoding 0xa8 0xbc should return the 0x1e3f codepoint on [step 5.6 of the gb18030 decoder](https://encoding.spec.whatwg.org/#gb18030-decoder), lead = 0xa8, byte = 0xbc, offset = 0x41 => pointer is 7533 which is 0x1e3f in the gb18030 index. The test expect 0xe7c7.

gb18030 decoder: same thing than gbk with 0xa8 0xbc.

euc-jp decoder:
0x5c should return 0x5c on [step 6 of the euc-jp decoder](https://encoding.spec.whatwg.org/#legacy-multi-byte-japanese-encodings) since it's an ascii byte. The test expects 0xa5.

0x7e should return 0x7e on [step 6 of the euc-jp decoder](https://encoding.spec.whatwg.org/#legacy-multi-byte-japanese-encodings) since it's an ascii byte. The test expects 0x203e.

0xa1 0xdd should return 0xff0d on [step 5.4 of the euc-jp decoder](https://encoding.spec.whatwg.org/#legacy-multi-byte-japanese-encodings), lead = 0xa1, byte = 0xdd => pointer is 60 which is 0xff0d in the jis0208 index. The test expects 0x2212.

iso-2022-jp decoder:
0x1b 0x24 0x42 0x21 0x5d 0x1b 0x28 0x42 should return 0xff0d in the [trail byte state of the iso-2022-jp decoder](https://encoding.spec.whatwg.org/#iso-2022-jp-decoder), lead = 0x21, byte = 0x5d => pointer is 60 which is 0xff0d in the jis0208 index. The test expects 0x2212.

Shift_jis:
encoder:
```
0x2116 should return 0xfa 0x59 but the test expects 0x87 0x82
0x2121 should return 0xfa 0x5a but the test expects 0x87 0x84
0x2160 should return 0xfa 0x4a but the test expects 0x87 0x54
0x2161 should return 0xfa 0x4b but the test expects 0x87 0x55
0x2162 should return 0xfa 0x4c but the test expects 0x87 0x56
0x2163 should return 0xfa 0x4d but the test expects 0x87 0x57
0x2164 should return 0xfa 0x4e but the test expects 0x87 0x58
0x2165 should return 0xfa 0x4f but the test expects 0x87 0x59
0x2166 should return 0xfa 0x50 but the test expects 0x87 0x5a
0x2167 should return 0xfa 0x51 but the test expects 0x87 0x5b
0x2168 should return 0xfa 0x52 but the test expects 0x87 0x5c
0x2169 should return 0xfa 0x53 but the test expects 0x87 0x5d
0x221a should return 0x87 0x95 but the test expects 0x81 0xe3
0x2220 should return 0x87 0x97 but the test expects 0x81 0xda
0x2229 should return 0x87 0x9b but the test expects 0x81 0xbf
0x222a should return 0x87 0x9c but the test expects 0x81 0xbe
0x222b should return 0x87 0x92 but the test expects 0x81 0xe7
0x2235 should return 0xfa 0x5b but the test expects 0x81 0xe6
0x2252 should return 0x87 0x90 but the test expects 0x81 0xe0
0x2261 should return 0x87 0x91 but the test expects 0x81 0xdf
0x22a5 should return 0x87 0x96 but the test expects 0x81 0xdb
0x3231 should return 0xfa 0x58 but the test expects 0x87 0x8a
0xffe2 should return 0xfa 0x54 but the test expects 0x81 0xca
```
Those errors are cause by the fact that the jis0208 index contains 2 or 3 pointers for those codepoints. The [shift_jis encoder](https://encoding.spec.whatwg.org/#shift_jis-encoder) uses the [index shift_jis pointer](https://encoding.spec.whatwg.org/#index-shift_jis-pointer) algorithm to guard for the codepoints with pointer in the range 8272 to 8835 but most of those codepoint's pointers are not in that ranges, and the tests expect the use of the last pointer if there are several pointers for a given codepoint, but the algorithm in the spec doesn't specify that.

decoder: same problems than euc-jp decoder with
0x5c should be 0x5c but the test expects 0xa5
0x7e should be 0x7e but the test expects 0x203e
0x81 0x7c should be 0xff0d but the test expects 0x2212.

A final point is that there are no test (as far as I can tell) for the cases where the big5 decoder returns 2 codepoints. Here are the values for such tests and the expected results:
0x88 0x62 ( bytes ) should return 0xca 0x304 (codepoints)
0x88 0x64 ( bytes ) should return 0xca 0x30c (codepoints)
0x88, 0xa3 ( bytes ) should return 0xea 0x304 (codepoints)
0x88 0xa5 ( bytes ) should return 0xea 0x30c (codepoints)

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/169

Received on Thursday, 20 December 2018 18:42:57 UTC