- From: Mihai Nita <notifications@github.com>
- Date: Tue, 29 Jan 2019 17:10:24 +0000 (UTC)
- To: whatwg/encoding <encoding@noreply.github.com>
- Cc: Subscribed <subscribed@noreply.github.com>
- Message-ID: <whatwg/encoding/issues/171/458624236@github.com>
> The warning would be for content producers, who also must use UTF-8 per the standard already, > so they are already in violation of sorts. The trouble here is that XSS attacks are not coming from "friendly" producers. If we say "don't do this, it is a possible vulnerability", we are just inviting bad actors to abuse it. > I'm reluctant to make a special case for 0x5C in the the general ASCII unmasking policy I am not advocating special treatment for `5C`. I thing that a valid lead-byte - trailing byte sequence, recognized as such, should be treated as one "unit". And either converted to a Unicode character (if there is a mapping), or convert to `FFFD` if there is no conversion. But not convert half (to `FFFD`) and keep the other half. > I'd be interested in learning what Big5-HKSCS generator can generate byte pairs that the index in > the Encoding Standard does not have mappings for Anything [Windows Code page 950](https://en.wikipedia.org/wiki/Code_page_950). So probably most (all?) Windows APIs. I can run some tests, if you want. And also ICU: ``` @Test public void testBig5() { byte [] bytes = { (byte) 0x83, (byte) 0x5C }; String charsetName = "Big5"; java.nio.charset.Charset cs = java.nio.charset.Charset.forName(charsetName); System.out.println(cs + " : " + cs.aliases()); System.out.println(hex(new String(bytes, cs))); cs = java.nio.charset.Charset.forName("cp950"); System.out.println(cs + " : " + cs.aliases()); System.out.println(hex(new String(bytes, cs))); cs = com.ibm.icu.charset.CharsetICU.forNameICU(charsetName); System.out.println(cs + " : " + cs.aliases()); System.out.println(hex(new String(bytes, cs))); } ``` The code above produces: ``` Big5 : [csBig5] FFFD 005C x-IBM950 : [cp950, ibm950, 950, ibm-950] F00E Big5 : [windows-950, csBig5, x-big5, Big5, x-windows-950, windows-950-2000, ms950] F00E ``` (the `hex` method does a simple char by char hex dump, I did not include it here) We notice that Java considers Big5 and cp950 to be different charsets, but ICU4J considers them aliases. ----- I totally understand the reluctance to change a spec, and to change implementations. We would all like to produce code / specs that are perfect, bug free, and never need updating. But this is the reality of things: we find problems, we should fix them. I would really hate to see some exploit based on something this a few months down the line... Double byte character sets (DBCS) have been a source of bugs and exploits for a long time. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/whatwg/encoding/issues/171#issuecomment-458624236
Received on Tuesday, 29 January 2019 17:10:49 UTC