[whatwg/encoding] End-of-queue during decoding of GB18030 should not mask ASCII characters. (#253)

https://encoding.spec.whatwg.org/commit-snapshots/4d54adce6a871cb03af3a919cbf644a43c22301a/#gb18030-decoder


> If byte is end\-of\-queue, and gb18030 first, gb18030 second, or gb18030 third is not 0x00, set gb18030 first, gb18030 second, and gb18030 third to 0x00, and return error\. 

I think this violates the requirements in the [Security Background section](https://encoding.spec.whatwg.org/#security-background)
> Decoders of encodings that use multiple bytes for scalar values now require that in case of an illegal byte combination, a scalar value in the range U+0000 to U+007F, inclusive, cannot be “masked”.

In particular, the input sequence 0x81 0x30 should, by my reading of the sentence quoted above, produce U+FFFD U+0030 but according to the specification, only a single U+FFFD is produced.

The Rust crate `encoding_rs` agrees with the specification.
```rust
use encoding_rs::*;

fn main() {
    let (output, replacements) = GB18030.decode_without_bom_handling(&[0x81, 0x30]);
    assert!(replacements);
    assert_eq!(output, "\u{FFFD}0");
}
```
The second assertion fails because the output contains just the U+FFFD.

Chrome, Firefox, and Safari all agree with `encoding_rs`. E.g., appending the byte sequence 0x81 0x30 to
```html
<!DOCTYPE html>
<html>
  <head>
    <meta charset=gb18030>
  </head>
  <body>
    <span id=bug>
```
results in the `span` element containing just U+FFFD.

Maybe this masking of an ASCII character at the end of the input is fine and the security background should be updated instead to note that fact.

A similar issue arises with the byte sequence 0x81 0x30 0x81 which I'd expect to be U+FFFD U+0030 U+FFFD but instead decodes to a single replacement U+FFFD.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/253

Received on Wednesday, 24 February 2021 17:37:16 UTC