Re: [FileAPI, common] UTF-16 to UTF-8 conversion from Glenn Maynard on 2012-02-29 (public-webapps@w3.org from January to March 2012)

From: Glenn Maynard <glenn@zewt.org>
Date: Tue, 28 Feb 2012 18:07:05 -0600
To: Jonas Sicking <jonas@sicking.cc>
Cc: Simon Pieters <simonp@opera.com>, Arun Ranganathan <aranganathan@mozilla.com>, Eric U <ericu@google.com>, public-webapps@w3.org
Message-ID: <CABirCh_uK-MaqexN+q9wRqRTDHY6rj8Mzobhzx7Mw7SY9HMV-w@mail.gmail.com>

On Tue, Feb 28, 2012 at 12:11 AM, Simon Pieters <simonp@opera.com> wrote:

> I think WebSocket should do the same, for the same reason.
>
> Have you filed a bug?

(No, not until this conversation moves along a bit further.)

On Tue, Feb 28, 2012 at 8:26 AM, Jonas Sicking <jonas@sicking.cc> wrote:

> I agree that it "scrambles" the data. But no more than the HTML parser
> error recovery does. And if an unexpected exception is thrown then the
> result is likely dataloss which is not obviously better than
> scrambling part of the data.
>

I'd say it's weaker than "scrambles", actually, at least with
human-readable text.  Replacing one character with U+FFFD usually results
in an isolated glitch that a reader can recover from.  (I do this regularly
when reading the HTML spec, which uses characters not widely supported, in
particular "Steps in synchronous sections are marked with ?.")

Also, even if you're attentive to handling these errors, most of the time
you don't want to.  In my experience, it's very uncommon to want to
explicitly handle very rare errors like "the user pasted in an unpaired
surrogate".  There's rarely anything useful you can do, except to walk
through the string and change the unpaired surrogates to U+FFFD, so you can
move on.  I'd rather just get U+FFFD to begin with.

-- 
Glenn Maynard

Received on Wednesday, 29 February 2012 00:07:32 UTC