[whatwg/encoding] Corner cases arising from Big5 encoder not excluding HKSCS codes with lead bytes 0xFA–FE (#252) from HarJIT on 2021-02-15 (public-webapps-github@w3.org from February 2021)

From: HarJIT <notifications@github.com>
Date: Mon, 15 Feb 2021 12:35:50 -0800
To: whatwg/encoding <encoding@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/encoding/issues/252@github.com>

https://encoding.spec.whatwg.org/commit-snapshots/4d54adce6a871cb03af3a919cbf644a43c22301a/#visualization


> Let index be index Big5 excluding all entries whose pointer is less than \(0xA1 \- 0x81\) × 157\.
> 
> Avoid returning Hong Kong Supplementary Character Set extensions literally\. 

As become apparent in my attempts to [chart different Big5 and CNS 11643 variants](https://harjit.moe/cns-conc.html): if the intention is to make the encoder purely [Big5-ETEN](https://moztw.org/docs/big5/table/eten.txt), excluding all further extensions that Big5-HKSCS adds on top of it, then lead bytes 0xFA–FE need to be excluded, not just 0x81–A0.

The only-partial exclusion of HKSCS in the encoder defined by the current standard actually creates some truly bizarre corner cases, insofar as how it interacts with index-big5's inclusion of the duplicate mappings inherited from GCCS (which a lot of even HKSCS-equipped Big5 codecs, e.g. Python's `big5-hkscs`, do not accept).&ensp; Some of these duplicated other GCCS/HKSCS codes, rather than standard Big5 codes.&ensp; In four cases, one of these GCCS duplicates has a lead byte in 0xFA–FE, while its standard HKSCS code has a lead byte in 0x81–A0.&ensp; Hence, the WHATWG-described behaviour finishes up decoding them from both, but encoding them to their GCCS duplicates as follows.

```
0x9DEF → 嘅 U+5605 ↔ 0xFB48
0x9DFB → 廐 U+5ED0 ↔ 0xFBF9
0xA0DC → 悤 U+60A4 ↔ 0xFC6C
0x9975 → 猪 U+732A ↔ 0xFE52
```

Accepting these GCCS duplicates is probably fine, but generating them (when not even all HKSCS-equipped implementations will accept them) is probably inappropriate, even assuming (for sake of argument) that the encoder's current partway-house between Big5-ETEN and Big5-HKSCS was deliberately chosen.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/252

Received on Monday, 15 February 2021 20:36:03 UTC