- From: <bugzilla@jessica.w3.org>
- Date: Wed, 04 Mar 2015 23:33:24 +0000
- To: www-international@w3.org
https://www.w3.org/Bugs/Public/show_bug.cgi?id=28141 Bug ID: 28141 Summary: treatment of invalid 2-byte sequence is different in different CJK encodings Product: WHATWG Version: unspecified Hardware: PC OS: Linux Status: NEW Severity: normal Priority: P2 Component: Encoding Assignee: annevk@annevk.nl Reporter: jshin@chromium.org QA Contact: sideshowbarker+encodingspec@gmail.com CC: mike@w3.org, www-international@w3.org Per bug 16691 comment 15, I'm tightening Blink's encoding tables for CJK encodings to handle unmappable 2-byte sequence in a safe manner. The current spec has the following provision after looking up |pointer|. * EUC-KR decoder If pointer is null and byte is in the range 0x00 to 0x7F, prepend byte to stream. * Big5 decoder If pointer is null and byte is in the range 0x00 to 0x7F, prepend byte to stream. * Shift_JIS decoder If pointer is null, prepend byte to stream. * EUC-JP decoder If byte is not in the range 0xA1 to 0xFE, prepend byte to stream. * GB18030 decoder If pointer is null, prepend byte to stream. For now, let's put aside EUC-JP and GB18030. I don't see a reason to make SJIS decoder behave differently than EUC-KR and Big5 decoder. One possible reason may be that [xA1, xDF] is a character by itself. In SJIS, "0xFC 0xE0" [1] is turned to U+FFFD, but the second byte (0xE0) becomes the lead of what follows. In EUC-KR, "0xFE 0xE0" is turned to U+FFFD and the next lead byte is taken from the byte *after* 0xE0. Shouldn't we change the part of SJIS decoder quoted above to the following? If pointer is null and byte is in the range of 0x00 - 0x7F, prepend byte to the stream. -- You are receiving this mail because: You are on the CC list for the bug.
Received on Wednesday, 4 March 2015 23:33:26 UTC