- From: <bugzilla@jessica.w3.org>
- Date: Wed, 03 Jun 2015 20:44:50 +0000
- To: www-international@w3.org
https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740 Jungshik Shin <jshin@chromium.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|GB18030-2000 and |GB18030-2000 and |GB18030-2005 : Decide what |GB18030-2005 : Decide what |to do about their |to do about their |differences |differences, especially PUA | |codepoints in GB18030-2000 --- Comment #3 from Jungshik Shin <jshin@chromium.org> --- Webkit and Blink have these for GBK (but not gb18030 [1]). switch (character) { case 0x01F9: return 0xE7C8; case 0x1E3F: return 0xE7C7; case 0x22EF: return 0x2026; case 0x301C: return 0xFF5E; } What the above code snippet does is add one-way mapping (fromUnicode) 1. U+01F9 => xA8xBF ICU's GBK (windows-936) has U+E7C8 <=> xA8xBF The encoding spec and ICU's gb18030 have U+01F9 <=> xA8xBF This one is easy. I'll change Chrome's GBK to use U+01F9 instead of U+E7C8 (PUA) for xA8xBF. 2. U+1E3F => xA8xBC ICU's GBK has U+E7C7 <=> xA8xBC while its gb18030 has U+1E3F <=> xA8xBC index-gb18030 also has PUA mapping ( U+E7C7) for xA8xBC. U+1E3F has been in the Unicode since 1.1.0. Anyway, this may be another case of GB18030-2000 vs GB18030-2005. And, I propose that the spec be changed to use U+1E3F for xA8xBC instead of U+E7C7 (PUA) 3. U+22EF => xA1xAD All three (the spec, GBK and GB18030 in ICU) have U+2026 <=> xA1xAD. U+2026 : Horizontal Ellipsis U+22EF : Midline horizontal ellipsis 4. U+301C => xA1xAB All three have U+FF5E <=> xA1xAB U+FF5E : full-width tilde U+301C : wave dash #3 and #4 should be dealt with separately even if we want to consider them. My gut sense is that it's not that important. I guess Webkit did that because the old Mac converter uses U+301C and U+22EF instead of U+FF5E and u+2026. As I wrote above, #1 is a Chromium issue. Only #2 is relevant here. We can generalize this bug to decide what to do about PUA code points in GB18030 and GBK. IMHO, we'd better avoid mapping to PUA code points as much as possible. If there are regular encoded Unicode characters, we'd better use them, instead. That is more or less in line with using GB18030-2005 mapping instead of 2000. [1] Blink code link: https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/Source/wtf/text/TextCodecICU.cpp&l=380 -- You are receiving this mail because: You are on the CC list for the bug.
Received on Wednesday, 3 June 2015 20:44:52 UTC