Re: [FileAPI, common] UTF-16 to UTF-8 conversion from Arun Ranganathan on 2012-02-29 (public-webapps@w3.org from January to March 2012)

From: Arun Ranganathan <aranganathan@mozilla.com>
Date: Tue, 28 Feb 2012 16:46:13 -0800 (PST)
To: Glenn Maynard <glenn@zewt.org>, Simon Pieters <simonp@opera.com>, Eric U <ericu@google.com>
Cc: public-webapps@w3.org, Jonas Sicking <jonas@sicking.cc>
Message-ID: <177945552.1215415.1330476373935.JavaMail.root@zimbra1.shared.sjc1.mozilla.com>

On Tue, Feb 28, 2012 at 12:11 AM, Simon Pieters < simonp@opera.com > wrote: 

> > I think WebSocket should do the same, for the same reason.
> 

> > Have you filed a bug?
> 
> (No, not until this conversation moves along a bit further.)

> On Tue, Feb 28, 2012 at 8:26 AM, Jonas Sicking <jonas@sicking.cc>
> wrote:

> > I agree that it "scrambles" the data. But no more than the HTML
> > parser error recovery does. And if an unexpected exception is
> > thrown
> > then the
> 
> > result is likely dataloss which is not obviously better than
> 
> > scrambling part of the data.
> 

> I'd say it's weaker than "scrambles", actually, at least with
> human-readable text. Replacing one character with U+FFFD usually
> results in an isolated glitch that a reader can recover from. (I do
> this regularly when reading the HTML spec, which uses characters not
> widely supported, in particular "Steps in synchronous sections are
> marked with ?.")

> Also, even if you're attentive to handling these errors, most of the
> time you don't want to. In my experience, it's very uncommon to want
> to explicitly handle very rare errors like "the user pasted in an
> unpaired surrogate". There's rarely anything useful you can do,
> except to walk through the string and change the unpaired surrogates
> to U+FFFD, so you can move on. I'd rather just get U+FFFD to begin
> with.
OK, I've updated the Editor's Draft to reflect this. Essentially, I take Anne's advice about first converting the DOMString to a sequence of Unicode characters using the algorithm defined in WebIDL (namely this one: http://dev.w3.org/2006/webapi/WebIDL/#dfn-obtain-unicode). 

This actually seems to take care of unmatched surrogates from UTF-16 when you use a UTF-8 decoding on the Unicode characters following the algorithmic conversion, and so we may have what we need here. 

This is the 29th February Editor's Draft (ensure you shift-reload if necessary): 

http://dev.w3.org/2006/webapi/FileAPI/ 

I'd appreciate a review. If this passes muster, we may be one step further along the way to deprecating BlobBuilder, which only stipulated writing out as UTF-8 when the DOMString was appended to the Blob. 

-- A*

Received on Wednesday, 29 February 2012 00:46:42 UTC