Re: [whatwg/encoding] 0xA3 0xA0 in GB 18030 (Issue #338) from Addison Phillips on 2024-11-21 (public-webapps-github@w3.org from November 2024)

From: Addison Phillips <notifications@github.com>
Date: Thu, 21 Nov 2024 14:28:33 -0800
To: whatwg/encoding <encoding@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/encoding/issues/338/2492477006@github.com>

I was [actioned](https://github.com/w3c/i18n-actions/issues/143) by I18N with responding to this issue, which we discussed in our 2024-11-21 call.

As near as I can tell, the code unit sequence 0xA3 0xA0 is not actually assigned in GB18030. A look at CJKV Information Processing suggests that the code space for two-byte sequences does not use the bytes 0xA0 and 0xFF. I do not have a copy of GB18030 handy to look at myself and don't have any direct experience implementing this encoding.

My local encoders (Oracle JVM 23.0.1 and ICU4J v76.1) produce U+E5E5 for this byte sequence. The reverse (encoding U+E5E5 to GB18030-2022) produces 0xA3 0xA0. I reproduce my code below, in case this is useful. I did not test ICU4C.

I've written to Ken Lunde to ask his advice. I do think U+3000 is a tiny bit weird, although the logical character before `!` in ASCII _is_ SPACE, so the logical character before U+FF01 (the full-width `！`) might be IDEOGRAPHIC SPACE?? 

I do not think that this represents a critical problem, since no data should exist in GB18303 that uses this byte sequence for anything meaningful. Replacing the sequence with one character or another should produce no meaningful difference, unless I'm not understanding something. But past experience with sequences in a legacy encoding producing different results in different coders have generally been that this becomes a problem at a later date. In this case, I don't think any graphical character will ever be assigned to this specific sequence, so it probably makes no difference.

My code:
```java
    public static void encoding338() {
        try {
            Charset gb18030 = Charset.forName("GB18030-2022");
            CharsetDecoder decoder = gb18030.newDecoder();
            ByteBuffer bb = ByteBuffer.wrap(new byte[] { (byte) 0xA3, (byte) 0xA0 });
            CharBuffer cb = decoder.decode(bb);
            System.out.println(Util.native2ascii(cb.toString())); // not standard code but does what you think it does
            
            Charset icu = CharsetICU.forNameICU("GB18030-2022");
            decoder = gb18030.newDecoder();
            bb = ByteBuffer.wrap(new byte[] { (byte) 0xA3, (byte) 0xA0 });
            cb = decoder.decode(bb);
            System.out.println(Util.native2ascii(cb.toString()));

            ByteBuffer out = gb18030.encode(cb);
            byte[] bytes = out.array();
            for (byte b : bytes) {
                System.out.print(Integer.toHexString((int) (b &0xFF)));
                System.out.print(' ');
            }
            System.out.println();
            cb.rewind();
            out = icu.encode(cb);
            bytes = out.array();
            for (byte b : bytes) {
                System.out.print(Integer.toHexString((int) (b &0xFF)));
                System.out.print(' ');
            }
        } catch (Throwable t) {
            t.printStackTrace();
        }
    }
```

Produces:

```
\ue5e5
\ue5e5
a3 a0 0 0 
a3 a0 0 0 
```

-- 
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/338#issuecomment-2492477006
You are receiving this because you are subscribed to this thread.

Message ID: <whatwg/encoding/issues/338/2492477006@github.com>

Received on Thursday, 21 November 2024 22:28:37 UTC