[whatwg/encoding] Corner cases arising from Big5 encoder not excluding HKSCS codes with lead bytes 0xFA–FE (#252)

https://encoding.spec.whatwg.org/commit-snapshots/4d54adce6a871cb03af3a919cbf644a43c22301a/#visualization


> Let index be index Big5 excluding all entries whose pointer is less than \(0xA1 \- 0x81\) × 157\.
> 
> Avoid returning Hong Kong Supplementary Character Set extensions literally\. 

As become apparent in my attempts to [chart different Big5 and CNS 11643 variants](https://harjit.moe/cns-conc.html): if the intention is to make the encoder purely [Big5-ETEN](https://moztw.org/docs/big5/table/eten.txt), excluding all further extensions that Big5-HKSCS adds on top of it, then lead bytes 0xFA–FE need to be excluded, not just 0x81–A0.

The only-partial exclusion of HKSCS in the encoder defined by the current standard actually creates some truly bizarre corner cases, insofar as how it interacts with index-big5's inclusion of the duplicate mappings inherited from GCCS (which a lot of even HKSCS-equipped Big5 codecs, e.g. Python's `big5-hkscs`, do not accept).  Some of these duplicated other GCCS/HKSCS codes, rather than standard Big5 codes.  In four cases, one of these GCCS duplicates has a lead byte in 0xFA–FE, while its standard HKSCS code has a lead byte in 0x81–A0.  Hence, the WHATWG-described behaviour finishes up decoding them from both, but encoding them to their GCCS duplicates as follows.

```
0x9DEF → 嘅 U+5605 ↔ 0xFB48
0x9DFB → 廐 U+5ED0 ↔ 0xFBF9
0xA0DC → 悤 U+60A4 ↔ 0xFC6C
0x9975 → 猪 U+732A ↔ 0xFE52
```

Accepting these GCCS duplicates is probably fine, but generating them (when not even all HKSCS-equipped implementations will accept them) is probably inappropriate, even assuming (for sake of argument) that the encoder's current partway-house between Big5-ETEN and Big5-HKSCS was deliberately chosen.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/252

Received on Monday, 15 February 2021 20:36:03 UTC