Re: [whatwg/encoding] Consider adding TextEncoder.containsLoneSurrogates() static (#174) from Ingvar Stepanyan on 2019-04-08 (public-webapps-github@w3.org from April 2019)

From: Ingvar Stepanyan <notifications@github.com>
Date: Mon, 08 Apr 2019 09:06:14 -0700
To: whatwg/encoding <encoding@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/encoding/issues/174/480895738@github.com>

> To be clear, unpaired surrogates are invalid UTF-16, and so anything that generates them is (in my opinion) a bug.

JavaScript doesn't claim UTF-16 compatibility though, so it's not really a bug, but rather part of the language and so such strings should be taken into account in any APIs that interact with JS IMO (especially in encoders/decoders).

As mentioned above, I'm usually working with parsers in JavaScript. However, even without going into parsers that parse JavaScript itself, the simplest example can be created even with built-in `JSON.parse` API:

```js
> JSON.parse(String.raw`"\uD800"`)
"\ud800"
```

Here the input is perfectly valid UTF-8 / UTF-16, but the output is not, and has to be checked whether when passing string to WASM or encoding it to the disk, otherwise you're risking silent data loss that is not trivial to debug.

Hence my suggestion to enhance `TextEncoder` to check for lone surrogates automatically and throw the error before even crossing the boundary or losing the data.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/174#issuecomment-480895738

Received on Monday, 8 April 2019 16:06:56 UTC