- From: Mihai Nita <notifications@github.com>
- Date: Thu, 24 Jan 2019 10:59:43 -0800
- To: whatwg/encoding <encoding@noreply.github.com>
- Cc: Subscribed <subscribed@noreply.github.com>
- Message-ID: <whatwg/encoding/issues/171/457315582@github.com>
Thank you for the quick answer. **==== TLDR ====** `82 22` which is an invalid Shift JIS byte sequence. `83 5C` is a valid Big 5 sequence, with not Unicode mapping. Because the `83 5C` is a **valid sequence** but has no mapping, the **whole sequence** should be replaced with `U+FFFD` (one or two of them, TBD), not just the lead byte. **==== The long version ====** I have tried to carefully read the links provided, but I think this is kind of the opposite... The bugs are about `82 22` which is indeed an invalid bytesequence for Shift_JIS. That handling is correct, because it is a 'valid lead' followed by 'invalid trailing' > If byte is in the range 0x40 to 0x7E, inclusive, or 0x80 to 0xFC, inclusive, set pointer to (lead − lead offset) × 188 + byte − offset. 0x22 is not in the [0x40, 0x7E] range This is different, as `83 5C` is a valid Big 5 sequence, valid lead + valid trailing, **but no Unicode mapping**, at least not in the table included with the spec (`index-big5.txt`). This can lead the an exploit similar to the one quoted in the bug (from 2011): > a <span>shift_jis</span> lead byte 0x82 was used to “mask” a 0x22 trail byte in a JSON resource > of which an attacker could control some field. The producer did not see the problem even though > this is an illegal byte combination. In this case `83 5C` can be used to mask a `0x22` (double quote) coming **AFTER** `5C` One might not see the problem, because `83 5C` is a valid Big 5 byte sequence, So `83 5C 22` maps (sometimes :-) to `U+F00E"` (PUA followed by a double quote). And with some other mappings is converted to `U+FFFD\"` which escapes the quote. To make things more interesting that sequence is actually mapped to a PUA character (`U+F00E`) in the Microsoft implementation (https://en.wikipedia.org/wiki/Code_page_950) Not really relevant to the bug... ---- Considering that there are many Big 5 extensions (and tables), this is quite likely to happen. (https://en.wikipedia.org/wiki/Big5). And most don't even have a IANA registration. In fact, I discovered this exactly from a client side JSON parsing (using this algorithm), with data produced server side (using the Big 5-HKSCS table by default). None of these extensions are registered with IANA, so there is no standard way to communicate that information to another client. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/whatwg/encoding/issues/171#issuecomment-457315582
Received on Thursday, 24 January 2019 19:00:05 UTC