[Bug 28740] New: GB18030 : specify that it's GB18030-2000 s opposed to GB18030-2005

https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740

            Bug ID: 28740
           Summary: GB18030 : specify that it's GB18030-2000 s opposed to
                    GB18030-2005
           Product: WHATWG
           Version: unspecified
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Encoding
          Assignee: annevk@annevk.nl
          Reporter: jshin@chromium.org
        QA Contact: sideshowbarker+encodingspec@gmail.com
                CC: mike@w3.org, www-international@w3.org

GB18030-2005 appears to map some 2-byte sequences to regular Unicode code
points as opposed to PUA code points in BMP. 

For instance, GB18030-2000 (and the current encoding spec and ICU's gb18030)
maps \xFE\x51 to U+E816. However, GB18030-2005 appears to map \xFE\x51 to
U+20087. [1]

The glyph for U+E816 in Simsun in Windows 8 visually matches the code chart
glyph for U+20087 ( (
http://www.fileformat.info/info/unicode/char/20087/index.htm ). 


I don't know how to represent U+E816 in GB18030-2005 because there's no gap in
4-byte sequence. The glibc implementation regards it as illegal, but it may not
be supposed to do that.[2] 


I propose that a note be added to the spec that it's GB18030-2000 instead of
GB18030-2005. 

[1] 
I couldn't get hold of GB18030-2005 spec and I'm using glibc's iconv as a
proxy:

$ printf '\xfe\x51' | LC_ALL=C iconv -t UTF-32BE -f GB18030 | hexdump -C
00000000  00 02 00 87                                     

[2] 
$ printf '\xe8\x16' | LC_ALL=C iconv -f UTF-16BE -t GB18030 | hexdump -C
iconv: illegal input sequence at position 0

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Received on Tuesday, 2 June 2015 22:03:05 UTC