Re: [FileAPI, common] UTF-16 to UTF-8 conversion from Jonas Sicking on 2012-02-28 (public-webapps@w3.org from January to March 2012)

From: Jonas Sicking <jonas@sicking.cc>
Date: Tue, 28 Feb 2012 15:26:51 +0100
To: Simon Pieters <simonp@opera.com>
Cc: Arun Ranganathan <aranganathan@mozilla.com>, Glenn Maynard <glenn@zewt.org>, Eric U <ericu@google.com>, public-webapps@w3.org
Message-ID: <CA+c2ei8Sc530PH+5XSTQg1AwDXTQKExAYqbAqu1gEP-ag_75aQ@mail.gmail.com>

On Tue, Feb 28, 2012 at 1:57 PM, Simon Pieters <simonp@opera.com> wrote:
>> My
>> preference would be to deal with them by encoding them to U+FFFD for
>> the same reason that we let the HTML parser do error recovery rather
>> than XML-style draconian error handling.
>
> I'm not really opposed to making APIs use U+FFFD instead of exception, but
> I'm not entirely convinced, either. If people use binary data in strings and
> want to use them in these APIs, U+FFFDing lone surrogates is going to
> "silently" scramble their data. Why is this better than throwing an
> exception?

I'm not so much worried that people will store binary and then attempt
to send it as text. I'm more worried people will do things like cut up
a string into parts and send the parts separately, or have bugs in
some search'n'replace code which could result in invalid surrogates
being created and then send the resulting strings over a websocket.
The error conditions would be very "intermittent" since it would
entirely depend on the data (which could be user provided) which is
being processed and so might not reproduce easily for the developer.

I agree that it "scrambles" the data. But no more than the HTML parser
error recovery does. And if an unexpected exception is thrown then the
result is likely dataloss which is not obviously better than
scrambling part of the data.

/ Jonas

Received on Tuesday, 28 February 2012 14:27:49 UTC