- From: <bugzilla@jessica.w3.org>
- Date: Tue, 02 Jun 2015 22:03:02 +0000
- To: www-international@w3.org
https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740 Bug ID: 28740 Summary: GB18030 : specify that it's GB18030-2000 s opposed to GB18030-2005 Product: WHATWG Version: unspecified Hardware: PC OS: Linux Status: NEW Severity: normal Priority: P2 Component: Encoding Assignee: annevk@annevk.nl Reporter: jshin@chromium.org QA Contact: sideshowbarker+encodingspec@gmail.com CC: mike@w3.org, www-international@w3.org GB18030-2005 appears to map some 2-byte sequences to regular Unicode code points as opposed to PUA code points in BMP. For instance, GB18030-2000 (and the current encoding spec and ICU's gb18030) maps \xFE\x51 to U+E816. However, GB18030-2005 appears to map \xFE\x51 to U+20087. [1] The glyph for U+E816 in Simsun in Windows 8 visually matches the code chart glyph for U+20087 ( ( http://www.fileformat.info/info/unicode/char/20087/index.htm ). I don't know how to represent U+E816 in GB18030-2005 because there's no gap in 4-byte sequence. The glibc implementation regards it as illegal, but it may not be supposed to do that.[2] I propose that a note be added to the spec that it's GB18030-2000 instead of GB18030-2005. [1] I couldn't get hold of GB18030-2005 spec and I'm using glibc's iconv as a proxy: $ printf '\xfe\x51' | LC_ALL=C iconv -t UTF-32BE -f GB18030 | hexdump -C 00000000 00 02 00 87 [2] $ printf '\xe8\x16' | LC_ALL=C iconv -f UTF-16BE -t GB18030 | hexdump -C iconv: illegal input sequence at position 0 -- You are receiving this mail because: You are on the CC list for the bug.
Received on Tuesday, 2 June 2015 22:03:05 UTC