[whatwg/encoding] 0xA3 0xA0 in GB 18030 (Issue #338)

### What is the issue with the Encoding Standard?

https://encoding.spec.whatwg.org/#gb18030-encoder

> Index gb18030 maps 0xA3 0xA0 to U+3000 rather than U+E5E5 for compatibility with deployed content. Therefore it cannot roundtrip.

We didn't update this in https://github.com/whatwg/encoding/pull/336 , so I filed this issue to track it.

https://bugzilla.mozilla.org/show_bug.cgi?id=131837 , a bug filed in 2002 mentioned this. The reason behind this mapping was that some websites use 0xA3 0xA0 as space characters, which causes display abnormalities, so Mozilla changed the mapping to `U+3000 IDEOGRAPHIC SPACE`.

We need to analyze how many websites using GB 18030 are still using 0xA3 0xA0 to represent U+3000.

Currently, iconv and ICU seem to map 0xA3 0xA0 to U+E5E5.


---------

The following is some information about this misuse (mostly translated from a [Chinese website](https://www.zhihu.com/question/3935081132)).

The 0xA3A1 ~ 0xA3FE part of GB18030-2022 is inherited from row 3 of GB 2312, and contains the G0 set of GB/T 1988-80 (ISO 646-CN). GB 2312 does not specify the width of these characters, but subsequent standards (such as GB 5007.1-85) made it clear that characters in row 3 are full-width, which are mapped to the Halfwidth and Fullwidth Forms Unicode block.

However, the G0 set of GB/T 1988-80 does not include spaces, but influenced by ASCII, people often consider spaces together with the remaining 94 characters. Now let's assume that someone thinks that 0xA3A1 ~ 0xA3FE are full-width ASCII characters (although "$" has been replaced by "¥"), then this person is likely to think that 0xA3 0xA0 should be a full-width space (although the actual full-width space is at 0xA1A1). Because some fonts display .notdef as a 1 em wide space, even when the corresponding Unicode code point of the two are different, the rendering is the same (undefined PUA code points in GB encoding will be displayed as .notdef).

-- 
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/338
You are receiving this because you are subscribed to this thread.

Message ID: <whatwg/encoding/issues/338@github.com>

Received on Wednesday, 13 November 2024 01:52:04 UTC