[whatwg/encoding] Bug in TextDecoderStream around processing the end of stream. (#263)

While working on denoland/deno#10842, I noticed a bug with the "flush and enqueue" algorithm in `TextDecoderStream`.

`TextDecoderStream`'s "decode and enqueue a chunk" algorithm essentially performs `TextDecoder.prototype.decode(chunk, {stream: true})`, and then "flush and enqueue" should perform the final `TextDecoder.prototype.decode()` to emit a replacement character if the input stream was cut short. So you might expect "flush and enqueue" to be defined like this:

> 1. Let `output` be the I/O queue of scalar values « end-of-queue ».
> 1. While true:
>    1. Let `item` be the result of reading from `decoder`'s I/O queue.
>    1. Let `result` be the result of processing an item with `item`, `decoder`'s decoder, `decoder`'s I/O queue, `output`, and `decoder`'s error mode.
>    1. If `result` is finished, then:
>       1. Let `outputChunk` be the result of running serialize I/O queue with `decoder` and `output`.
>       1. If `outputChunk` is non-empty, then enqueue `outputChunk` in `decoder`'s transform.
>       1. Return.
>    1. Otherwise, if `result` is error, throw a `TypeError`.

But it's instead:

> 1. Let `output` be the I/O queue of scalar values « end-of-queue ».
> 1. Let `result` be the result of processing an item with end-of-queue, `decoder`'s decoder, `decoder`'s I/O queue, `output`, and `decoder`'s error mode.
> 1. If `result` is finished, then:
>    1. Let `outputChunk` be the result of running serialize I/O queue with `decoder` and `output`.
>    1. If `outputChunk` is non-empty, then enqueue `outputChunk` in `decoder`'s transform.
> 1. Otherwise, throw a `TypeError`.

These are not equivalent because "process an item" can return finished, error or continue – and in the continue case, the loop continues to the next iteration, rather than throwing. What's more, when "process an item" returns finished, `output` won't have changed during that call to "process an item", meaning that in the single-iteration case, no output can be enqueued, even if we don't throw on the continue case.

-----

But maybe we can refactor things while still skipping the loop, since even when the first call to "process an item" returns continue, the decoder's state is set such that the next call returns finished, right? That is true for most encodings but –of course– ISO-2022-JP is the exception.

<details>
Take the I/O queue of bytes « 0x1B, 0x24, end-of-queue » as the input to the ISO-2022-JP decoder. After processing the first two items, the state of the I/O queues and the decoder is:

> Input I/O queue: « end-of-queue »
> Output I/O queue: « end-of-queue »
> ISO-2022-JP decoder state: Escape
> ISO-2022-JP decoder output state: ASCII
> ISO-2022-JP lead: 0x24
> ISO-2022-JP output: false

The next call to "process an item" would pass end-of-queue as the item, which would cause the decoder's handler to prepend 0x24 and end-of-queue to the input I/O queue. (Prepending end-of-queue is in fact illegal, but in this case that end-of-queue item can't block access to further non-end-of-queue items, so let's just ignore that for now.) Assuming we're using the `"replacement"` error mode:

> Input I/O queue: « 0x24, end-of-queue, end-of-queue »
> Output I/O queue: « U+FFFD, end-of-queue »
> ISO-2022-JP decoder state: ASCII
> ISO-2022-JP decoder output state: ASCII
> ISO-2022-JP lead: 0x24
> ISO-2022-JP output: false

Calling it again gets us:

> Input I/O queue: « end-of-queue, end-of-queue »
> Output I/O queue: « U+FFFD, U+0024, end-of-queue »
> ISO-2022-JP decoder state: ASCII
> ISO-2022-JP decoder output state: ASCII
> ISO-2022-JP lead: 0x24
> ISO-2022-JP output: false

And it's only in the next call to "process an item" that it returns finished.
</details>

So if we only run the first iteration of the loop (or if in the loop we only call "process an item" with end-of-queue as input, assuming the decoder's handler hasn't modified the input I/O queue), the result we would get from decoding the byte sequence 0x1B 0x24 as ISO-2022-JP would be the string consisting of the single code point U+FFFD, rather than U+FFFD U+0024.

-------------

The implementations of `TextDecoderStream` in Chromium and WebKit seem to agree with the behavior of `TextDecoder`, both in streaming and non-streaming mode. But Chromium seems to have bugs around decoding valid but cut-off byte sequences – for example, `new Uint8Array([0xC3 0xA1 0xF0 0x9F 0x92])` UTF-8 decodes in Chrome to `"á���"`, rather than to `"á�"` (this doesn't seem to be covered by any of the current WPT tests). And in the ISO-2022-JP case, `new Uint8Array([0x61, 0x1B, 0x24])` decodes in Chrome to `"a�"`, which is consistent with a single iteration of the loop, rather than the expected `"a�$"`. The former bug doesn't seem to be covered by any of the current WPT tests, but the latter _is_ covered by the "Error ESC" test in [`encoding/iso-2022-jp-decoder.any.js`](https://wpt.fyi/results/encoding/iso-2022-jp-decoder.any.html).

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/263

Received on Friday, 4 June 2021 21:08:17 UTC