- From: <bugzilla@jessica.w3.org>
- Date: Tue, 02 Jun 2015 22:03:02 +0000
- To: www-international@w3.org
https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740
Bug ID: 28740
Summary: GB18030 : specify that it's GB18030-2000 s opposed to
GB18030-2005
Product: WHATWG
Version: unspecified
Hardware: PC
OS: Linux
Status: NEW
Severity: normal
Priority: P2
Component: Encoding
Assignee: annevk@annevk.nl
Reporter: jshin@chromium.org
QA Contact: sideshowbarker+encodingspec@gmail.com
CC: mike@w3.org, www-international@w3.org
GB18030-2005 appears to map some 2-byte sequences to regular Unicode code
points as opposed to PUA code points in BMP.
For instance, GB18030-2000 (and the current encoding spec and ICU's gb18030)
maps \xFE\x51 to U+E816. However, GB18030-2005 appears to map \xFE\x51 to
U+20087. [1]
The glyph for U+E816 in Simsun in Windows 8 visually matches the code chart
glyph for U+20087 ( (
http://www.fileformat.info/info/unicode/char/20087/index.htm ).
I don't know how to represent U+E816 in GB18030-2005 because there's no gap in
4-byte sequence. The glibc implementation regards it as illegal, but it may not
be supposed to do that.[2]
I propose that a note be added to the spec that it's GB18030-2000 instead of
GB18030-2005.
[1]
I couldn't get hold of GB18030-2005 spec and I'm using glibc's iconv as a
proxy:
$ printf '\xfe\x51' | LC_ALL=C iconv -t UTF-32BE -f GB18030 | hexdump -C
00000000 00 02 00 87
[2]
$ printf '\xe8\x16' | LC_ALL=C iconv -f UTF-16BE -t GB18030 | hexdump -C
iconv: illegal input sequence at position 0
--
You are receiving this mail because:
You are on the CC list for the bug.
Received on Tuesday, 2 June 2015 22:03:05 UTC