- From: HarJIT <notifications@github.com>
- Date: Mon, 15 Feb 2021 12:35:50 -0800
- To: whatwg/encoding <encoding@noreply.github.com>
- Cc: Subscribed <subscribed@noreply.github.com>
- Message-ID: <whatwg/encoding/issues/252@github.com>
https://encoding.spec.whatwg.org/commit-snapshots/4d54adce6a871cb03af3a919cbf644a43c22301a/#visualization > Let index be index Big5 excluding all entries whose pointer is less than \(0xA1 \- 0x81\) × 157\. > > Avoid returning Hong Kong Supplementary Character Set extensions literally\. As become apparent in my attempts to [chart different Big5 and CNS 11643 variants](https://harjit.moe/cns-conc.html): if the intention is to make the encoder purely [Big5-ETEN](https://moztw.org/docs/big5/table/eten.txt), excluding all further extensions that Big5-HKSCS adds on top of it, then lead bytes 0xFA–FE need to be excluded, not just 0x81–A0. The only-partial exclusion of HKSCS in the encoder defined by the current standard actually creates some truly bizarre corner cases, insofar as how it interacts with index-big5's inclusion of the duplicate mappings inherited from GCCS (which a lot of even HKSCS-equipped Big5 codecs, e.g. Python's `big5-hkscs`, do not accept).  Some of these duplicated other GCCS/HKSCS codes, rather than standard Big5 codes.  In four cases, one of these GCCS duplicates has a lead byte in 0xFA–FE, while its standard HKSCS code has a lead byte in 0x81–A0.  Hence, the WHATWG-described behaviour finishes up decoding them from both, but encoding them to their GCCS duplicates as follows. ``` 0x9DEF → 嘅 U+5605 ↔ 0xFB48 0x9DFB → 廐 U+5ED0 ↔ 0xFBF9 0xA0DC → 悤 U+60A4 ↔ 0xFC6C 0x9975 → 猪 U+732A ↔ 0xFE52 ``` Accepting these GCCS duplicates is probably fine, but generating them (when not even all HKSCS-equipped implementations will accept them) is probably inappropriate, even assuming (for sake of argument) that the encoder's current partway-house between Big5-ETEN and Big5-HKSCS was deliberately chosen. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/whatwg/encoding/issues/252
Received on Monday, 15 February 2021 20:36:03 UTC