Re: [whatwg/encoding] Consider adding TextEncoder.containsLoneSurrogates() static (#174)

> JavaScript doesn't claim UTF-16 compatibility though, so it's not really a bug, but rather part of the language 

I am very aware of that. That doesn't change the fact that breaking UTF-16 is a *really* bad idea, which is why I said that it is (in my opinion) a bug.

And as @hsivonen has said, based on Firefox's experience the web generally doesn't generate or interact with unpaired surrogates (since pretty much every JS API never produces unpaired surrogates).

So even though *technically* JS isn't UTF-16, in practice it is, because nobody actually generates unpaired surrogates.

> Here the input is perfectly valid UTF-8 / UTF-16

I think that's debatable.

Sure, from the perspective of the consumer, before they run `JSON.parse` it appears to be valid UTF-16.

But from the perspective of the producer, they had a string which was invalid UTF-16, and then they called `JSON.stringify` (or similar) on it, and sent it to the consumer.

So I would say that that is a bug in the producer, since they should have never created an invalid string in the first place.

Basically, except in contrived examples, *somebody* messed up and generated an invalid string. And so it's their responsibility to fix that.

So I'm asking for non-contrived examples of where unpaired surrogates were generated.

> Hence my suggestion to enhance TextEncoder to check for lone surrogates automatically and throw the error before even crossing the boundary or losing the data.

Like I said earlier, I think that's a good idea, but it's orthogonal to `TextEncoder.containsLoneSurrogate`, so I think you should advocate for that in a new issue.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/174#issuecomment-480915431

Received on Monday, 8 April 2019 17:00:42 UTC