Re: [whatwg/encoding] Big5 encoding mishandles some trailing bytes, with possible XSS (#171) from Mihai Nita on 2019-01-29 (public-webapps-github@w3.org from January 2019)

From: Mihai Nita <notifications@github.com>
Date: Tue, 29 Jan 2019 17:10:24 +0000 (UTC)
To: whatwg/encoding <encoding@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/encoding/issues/171/458624236@github.com>

> The warning would be for content producers, who also must use UTF-8 per the standard already,
> so they are already in violation of sorts.

The trouble here is that XSS attacks are not coming from "friendly" producers. If we say "don't do this, it is a possible vulnerability", we are just inviting bad actors to abuse it.


> I'm reluctant to make a special case for 0x5C in the the general ASCII unmasking policy

I am not advocating special treatment for `5C`.
I thing that a valid lead-byte - trailing byte sequence, recognized as such, should be treated as one "unit". And either converted to a Unicode character (if there is a mapping), or convert to `FFFD` if there is no conversion.
But not convert half (to `FFFD`) and keep the other half.


> I'd be interested in learning what Big5-HKSCS generator can generate byte pairs that the index in
> the Encoding Standard does not have mappings for

Anything [Windows Code page 950](https://en.wikipedia.org/wiki/Code_page_950). So probably most (all?) Windows APIs.
I can run some tests, if you want.

And also ICU:

```
  @Test
  public void testBig5() {
    byte [] bytes = { (byte) 0x83, (byte) 0x5C };
    String charsetName = "Big5";

    java.nio.charset.Charset cs = java.nio.charset.Charset.forName(charsetName);
    System.out.println(cs + " : " + cs.aliases());
    System.out.println(hex(new String(bytes, cs)));

    cs = java.nio.charset.Charset.forName("cp950");
    System.out.println(cs + " : " + cs.aliases());
    System.out.println(hex(new String(bytes, cs)));

    cs = com.ibm.icu.charset.CharsetICU.forNameICU(charsetName);
    System.out.println(cs + " : " + cs.aliases());
    System.out.println(hex(new String(bytes, cs)));
  }
```
The code above produces:
```
Big5 : [csBig5]
 FFFD 005C
x-IBM950 : [cp950, ibm950, 950, ibm-950]
 F00E
Big5 : [windows-950, csBig5, x-big5, Big5, x-windows-950, windows-950-2000, ms950]
 F00E
```

(the `hex` method does a simple char by char hex dump, I did not include it here)

We notice that Java considers Big5 and cp950 to be different charsets, but ICU4J considers them aliases.

-----

I totally understand the reluctance to change a spec, and to change implementations.
We would all like to produce code / specs that are perfect, bug free, and never need updating.
But this is the reality of things: we find problems, we should fix them.

I would really hate to see some exploit based on something this a few months down the line...
Double byte character sets (DBCS) have been a source of bugs and exploits for a long time.


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/171#issuecomment-458624236

Received on Tuesday, 29 January 2019 17:10:49 UTC