[Bug 28141] treatment of invalid 2-byte sequence is different in different CJK encodings from bugzilla@jessica.w3.org on 2015-03-18 (www-international@w3.org from January to March 2015)

From: <bugzilla@jessica.w3.org>
Date: Wed, 18 Mar 2015 21:13:31 +0000
To: www-international@w3.org
Message-ID: <bug-28141-4285-49YND1dJal@http.www.w3.org/Bugs/Public/>

https://www.w3.org/Bugs/Public/show_bug.cgi?id=28141

--- Comment #4 from Jungshik Shin <jshin@chromium.org> ---
ICU treats an 'illegal' byte sequence differently from a byte sequence
'unassigned' to a Unicode character. 

For instance, in EUC-KR (windows-949), <FE A1> is a valid byte sequence, but is
not assigned any character. So, the sequence as a whole is turned to U+FFFD. 

Without tightening the vaild trail byte range for EUC-KR [1], <FE 41> is a
valid byte sequence  and is converted to U+FFFD (exactly the same treatment as
<FE A1>). 

OTOH, <FE 22> has an illegal trail byte (because 0x22 is outside the trail byte
range for EUC-KR/Windows-949) and is turned to <U+FFFD, U+0022>  


The same is true of Shift_JIS. Because [80-FC] is the valid trail byte range,
<EB 9F> is turned to U+FFFD (there's no mapped character at this position)
instead of <U+FFFD> being emitted and '0x9F' being added back to the stream 



[1] Blink is just tightening up the valid trail byte range so that 'x41' will
not be valid any more if lead is C8 or higher.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Received on Wednesday, 18 March 2015 21:13:32 UTC