[Bug 28141] treatment of invalid 2-byte sequence is different in different CJK encodings


--- Comment #4 from Jungshik Shin <jshin@chromium.org> ---
ICU treats an 'illegal' byte sequence differently from a byte sequence
'unassigned' to a Unicode character. 

For instance, in EUC-KR (windows-949), <FE A1> is a valid byte sequence, but is
not assigned any character. So, the sequence as a whole is turned to U+FFFD. 

Without tightening the vaild trail byte range for EUC-KR [1], <FE 41> is a
valid byte sequence  and is converted to U+FFFD (exactly the same treatment as
<FE A1>). 

OTOH, <FE 22> has an illegal trail byte (because 0x22 is outside the trail byte
range for EUC-KR/Windows-949) and is turned to <U+FFFD, U+0022>  

The same is true of Shift_JIS. Because [80-FC] is the valid trail byte range,
<EB 9F> is turned to U+FFFD (there's no mapped character at this position)
instead of <U+FFFD> being emitted and '0x9F' being added back to the stream 

[1] Blink is just tightening up the valid trail byte range so that 'x41' will
not be valid any more if lead is C8 or higher.

You are receiving this mail because:
You are on the CC list for the bug.

Received on Wednesday, 18 March 2015 21:13:32 UTC