[whatwg/encoding] Should not prepend gb18030 second and third when range decode fails upon fourth (#110)

Consider a [run of byte pairs](https://hsivonen.com/test/moz/gb18030-alignment.htm) where each pair starts with a byte that's valid as first or third and the second byte is valid as second or fourth in a four-byte gb18030 sequence.

The first two pairs form a four-byte sequence that fails the gb18030 ranges lookup. The spec, as written, realigns the decoder such that the second pair of the first out-of-range four byte sequence is next treated as the first half of another four-byte sequence.

The test case linked to above decodes to four replacement characters in Firefox, Chromium and into four question marks in IE and Edge. (Didn't test Safari.) However, per spec, the run decodes into "�9¶ĭ¶�9".

One error causing the decoder to misalign such that subsequent non-error characters come into existence where there previously were none in browsers can't be good.

Suggested fix:
When processing the fourth byte in a sequence, first check it it is in the range 0x30 to 0x39 inclusive. If it isn't, prepend the second, third and fourth bytes in the sequence per current spec and return error.

Then set code point to the index gb18030 ranges code point. If code point is null, prepend only the fourth byte and return error.

This would deviate from the current browser behavior by unmasking the last ASCII byte in the sequence the way we always unmask failed ASCII in two-byte sequences. However, unlike in the current spec, the ASCII byte that's second in the sequence would get swallowed into a REPLACEMENT CHARACTER. This is fine, because we know that ASCII byte had non-ASCII bytes before and after.

(Trying to emit two errors with an ASCII character in between as one step would be a novel behavior for the spec and would totally mess up the assumptions I've made about error emission in implementation.)

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/110

Received on Thursday, 11 May 2017 09:32:31 UTC