[Bug 28740] GB18030-2000 and GB18030-2005 : Decide what to do about their differences, especially PUA codepoints in GB18030-2000

https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740

Jungshik Shin <jshin@chromium.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|GB18030-2000 and            |GB18030-2000 and
                   |GB18030-2005 : Decide what  |GB18030-2005 : Decide what
                   |to do about their           |to do about their
                   |differences                 |differences, especially PUA
                   |                            |codepoints in GB18030-2000

--- Comment #3 from Jungshik Shin <jshin@chromium.org> ---
Webkit and Blink have these for GBK (but not gb18030 [1]).   

 switch (character) {
    case 0x01F9:
        return 0xE7C8;
    case 0x1E3F:
        return 0xE7C7;
    case 0x22EF:
        return 0x2026;
    case 0x301C:
        return 0xFF5E;
    }

What the above code snippet does is add one-way mapping (fromUnicode) 

1. U+01F9 => xA8xBF    
   ICU's GBK (windows-936) has U+E7C8 <=> xA8xBF
   The encoding spec and ICU's gb18030 have U+01F9 <=> xA8xBF

  This one is easy. I'll change Chrome's GBK to use U+01F9 instead of U+E7C8
(PUA) for xA8xBF. 

2. U+1E3F => xA8xBC     

  ICU's GBK has U+E7C7 <=> xA8xBC while its gb18030 has U+1E3F <=> xA8xBC

  index-gb18030 also has PUA mapping ( U+E7C7) for xA8xBC. 
  U+1E3F has been in the Unicode since 1.1.0. 

  Anyway, this may be another case of GB18030-2000 vs GB18030-2005. 

  And, I propose that the spec be changed to use U+1E3F for xA8xBC instead of
U+E7C7 (PUA)

3. U+22EF => xA1xAD

   All three (the spec, GBK and GB18030 in ICU) have U+2026 <=> xA1xAD.

   U+2026 : Horizontal Ellipsis 
   U+22EF : Midline horizontal ellipsis

4. U+301C => xA1xAB

   All three have U+FF5E <=> xA1xAB

   U+FF5E : full-width tilde
   U+301C : wave dash

#3 and #4 should be dealt with separately even if we want to consider them. My
gut sense is that it's not that important. I guess Webkit did that because the
old Mac converter uses U+301C and U+22EF instead of U+FF5E and u+2026. 


As I wrote above, #1 is a Chromium issue. 

Only #2 is relevant here. 

We can generalize this bug to decide what to do about PUA code points in
GB18030 and GBK. 

IMHO, we'd better avoid mapping to PUA code points as much as possible. If
there are regular encoded Unicode characters, we'd better use them, instead.
That is more or less in line with using GB18030-2005 mapping instead of 2000. 

















[1] 
Blink code link:
https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/Source/wtf/text/TextCodecICU.cpp&l=380

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Received on Wednesday, 3 June 2015 20:44:52 UTC