Re: [whatwg/encoding] Big5 encoding mishandles some trailing bytes, with possible XSS (#171)

> I'm against any more tweaking in general

The more I've examined the issue reported here, the more convinced I am that

* The browser-side security mitigations for legacy Java (Unicode internally, conversion to legacy encoding at output time) and legacy PHP (bytes in, bytes out) server-side architectures are in conflict.
* The present security posture of the Encoding Standard is biased towards addressing the legacy PHP problem.
* We can't actually mitigate the fundamental security problems of the legacy PHP architecture, because if we are worried about a rogue byte masking an ASCII byte that in trail range, the attacker can choose the rogue byte such that it is a valid lead for the ASCII byte acting as trail.
* The issue reported here is a security problem for the legacy Java architecture, since there are encoders out there that can output byte pairs whose lead is in the 0x81 to 0x86 range if the attacker can feed arbitrary (BMP-only enough) Unicode into the system. (Either codepage 950 encoders emitting those bytes for PUA inputs or Big5-UAO encoders emitting those bytes for Unihan input. JSON in the original report here is a distraction and we should consider JS.)
* The issue reported here is addressable, unlike the fundamental problem of the PHP architecture.

So at the very least I think we _should tweak_ Big5 decode such that Big5 leads in the 0x81 to 0x86, inclusive, range consume the next byte, too, if it is in the Big5 trail range. I don't yet have an opinion on whether the byte pairs should result in error (U+FFFD) or in 950-consistent PUA code points. (Mixing UAO with HKSCS probably would hurt more than it would help.)

Since JIS X 0208 has undergone less extension and the extensions have happened ages ago, I think we _probably_ don't need to change Shift_JIS analogously for trails that are in the trail range but unmapped, because there _probably_ aren't any Shift_JIS-ish encoders around that could emit leads 0x82, 0x85, 0x86, 0x88, 0xEB, 0xEF, 0xEC or 0xFC with trail 0x5C for some Unicode input.

The 0x5C issue is moot for EUC-KR and gbk (but for opposite reasons).

> and extremely strongly against introducing any new decoding mapping to PUA (e.g for EUC-KR) in the Unicode.

If we don't change EUC-KR, Shift_JIS and gbk to match the corresponding Windows code pages, I think we should at the very least document how and why encodings that for non-PUA, non-U+0080 points match Windows code pages don't match the Windows code pages for PUA and U+0080.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/171#issuecomment-462068509

Received on Saturday, 9 February 2019 18:38:55 UTC