Re: [whatwg/encoding] Big5 encoding mishandles some trailing bytes, with possible XSS (#171)

Thank you for the quick answer.

**==== TLDR ====**

`82 22` which is an invalid Shift JIS byte sequence.
`83 5C` is a valid Big 5 sequence, with not Unicode mapping.

Because the `83 5C` is a **valid sequence** but has no mapping, the **whole sequence** should be replaced with `U+FFFD` (one or two of them, TBD), not just the lead byte.

**==== The long version ====**

I have tried to carefully read the links provided, but I think this is kind of the opposite...
The bugs are about `82 22` which is indeed an invalid bytesequence for Shift_JIS.

That handling is correct, because it is a 'valid lead' followed by 'invalid trailing'
> If byte is in the range 0x40 to 0x7E, inclusive, or 0x80 to 0xFC, inclusive, set pointer to (lead − lead offset) × 188 + byte − offset.
0x22 is not in the [0x40, 0x7E] range

This is different, as `83 5C` is a valid Big 5 sequence, valid lead + valid trailing, **but no Unicode mapping**, at least not in the table included with the spec (`index-big5.txt`).

This can lead the an exploit similar to the one quoted in the bug (from 2011):
> a <span>shift_jis</span> lead byte 0x82 was used to “mask” a 0x22 trail byte in a JSON resource
> of which an attacker could control some field. The producer did not see the problem even though
> this is an illegal byte combination.

In this case `83 5C` can be used to mask a `0x22` (double quote) coming **AFTER** `5C`

One might not see the problem, because `83 5C` is a valid Big 5 byte sequence,
So `83 5C 22` maps (sometimes :-) to `U+F00E"` (PUA followed by a double quote).
And with some other mappings is converted to `U+FFFD\"` which escapes the quote.

To make things more interesting that sequence is actually mapped to a PUA character (`U+F00E`) in the Microsoft implementation (https://en.wikipedia.org/wiki/Code_page_950)
Not really relevant to the bug...

----

Considering that there are many Big 5 extensions (and tables), this is quite likely to happen.
(https://en.wikipedia.org/wiki/Big5). And most don't even have a IANA registration.

In fact, I discovered this exactly from a client side JSON parsing (using this algorithm), with data produced server side (using the Big 5-HKSCS table by default).

None of these extensions are registered with IANA, so there is no standard way to communicate that information to another client.


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/171#issuecomment-457315582

Received on Thursday, 24 January 2019 19:00:05 UTC