- From: Henri Sivonen <notifications@github.com>
- Date: Sun, 15 Nov 2015 03:55:51 -0800
- To: whatwg/encoding <encoding@noreply.github.com>
- Message-ID: <whatwg/encoding/issues/14/156807104@github.com>
I think requiring the caller to pass data in chunks of valid UTF-16 is a feature and not a bug. As you note, it allows us not to have a separate streaming mode on the encoder side (thanks to ISO-2022-JP not being supported on the `TextEncoder` side). Additionally, accommodating strings that are not self-contained valid UTF-16 strings would be a step backwards in terms of steering the Web Platform in a direction that would allow browsers to use UTF-8 strings internally (except in the JS engine when a program manipulates a string by 16-bit units). Some years ago when I argued for `document.write` to take 16-bit code units instead of valid UTF-16 strings, getting rid of UTF-16 as an internal representation seemed hopeless. However, Servo gives me hope that we might be able to fix the design error of using UTF-16 as the browser-internal memory representation and use UTF-8 in the future. The least we can do on the spec side is to avoid adding new places that expose the internal memory representation of Unicode strings. Furthermore, having recently worked on a decoder that tries to fill char16_t output buffers fully even if it means that an astral character gets split across a buffer boundary and having worked on an encoder that tries to work properly (as if unpaired surrogates had been replaced with U+FFFD in the input) in the face of invalid input, I've come to especially appreciate Rust's notion of making UTF-8 validity guarantees part of the core notion of safety of the language itself. To the extent we are stuck with using UTF-16 as the browser-internal representation, I think we would benefit from enforcing UTF-16 validity at the boundary between the JS engine and the rest of the browser in order to be able to write non-JS engine code with the assumption that UTF-16 sequences are always valid. (As opposed to sprinkling unpaired surrogate handling all over the code base.) For these reasons, I think we should close this as "won't fix". --- Reply to this email directly or view it on GitHub: https://github.com/whatwg/encoding/issues/14#issuecomment-156807104
Received on Sunday, 15 November 2015 11:56:18 UTC