[whatwg/encoding] Big5 encoding mishandles some trailing bytes, with possible XSS (#171)

There are some sequences of bytes that are valid lead-trailing according to the description at https://encoding.spec.whatwg.org/#big5-decoder, but don't have a corresponding Unicode codepoint in the `index-big5.txt` mapping table.

This in this case the first byte is converted to U+FFFD, but the second one is left "as is". In some cases that second byte can be backslash (`5C`), which can be used to escape end of strings in JavaScript and potentially resulting in XSS.

Example (attached): the `83 5C` sequence
According to the algorithm at https://encoding.spec.whatwg.org/#big5-decoder

* the first byte is a lead byte (case 5, byte is in the range 0x81 to 0xFE, inclusive)
* for the second byte we have case 3
    * 3.1. Let offset be 0x40 (byte is less than 0x7F)
    * 3.2. byte is in the range 0x40 to 0x7E, inclusive => set pointer to `(lead − 0x81) × 157 + (byte − offset)`. The result of that is `(0x83 - 0x81) * 157 + (0x5C - 0x40)` which is `0x156`
   * there is no mapping for 0x156 in `index-big5.txt`, and because the `byte is an ASCII byte` we `prepend byte to stream` (case 3.6) and `Return error` (case 3.7)

The end result is a FFFD (from the error) followed by a 5C (the trailing byte, "as is")

You can see this in the attached file.
When opened in both Chrome and Firefox the text is rendered as the "Unicode REPLACEMENT CHARACTER" (correct) followed by a back-slash (incorrect).

This is a valid lead-trail byte sequence that should either be replaced by one single U+FFFD character, or by two U+FFFD characters. But the definitely the trailing byte should not be left "as is"

The possible exploit can use the trailing byte (which is backslash) to escape the end of a string, for example.
Checking the console of Firefox you will see the 'SyntaxError: "" string literal contains an unescaped line break' message. In Chrome the message is 'Uncaught SyntaxError: Invalid or unexpected token'

I did not check, but this might also happen in other DBCS (Double Byte Characters Sets) that have the second byte in the ASCII range (for instance in Shift JIS?).

[big5.zip](https://github.com/whatwg/encoding/files/2788604/big5.zip)


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/171

Received on Wednesday, 23 January 2019 18:49:01 UTC