Re: [FileAPI, common] UTF-16 to UTF-8 conversion

On Tue, Feb 28, 2012 at 12:11 AM, Simon Pieters < simonp@opera.com > wrote: 

> > I think WebSocket should do the same, for the same reason.
> 

> > Have you filed a bug?
> 
> (No, not until this conversation moves along a bit further.)

> On Tue, Feb 28, 2012 at 8:26 AM, Jonas Sicking <jonas@sicking.cc>
> wrote:

> > I agree that it "scrambles" the data. But no more than the HTML
> > parser error recovery does. And if an unexpected exception is
> > thrown
> > then the
> 
> > result is likely dataloss which is not obviously better than
> 
> > scrambling part of the data.
> 

> I'd say it's weaker than "scrambles", actually, at least with
> human-readable text. Replacing one character with U+FFFD usually
> results in an isolated glitch that a reader can recover from. (I do
> this regularly when reading the HTML spec, which uses characters not
> widely supported, in particular "Steps in synchronous sections are
> marked with ?.")

> Also, even if you're attentive to handling these errors, most of the
> time you don't want to. In my experience, it's very uncommon to want
> to explicitly handle very rare errors like "the user pasted in an
> unpaired surrogate". There's rarely anything useful you can do,
> except to walk through the string and change the unpaired surrogates
> to U+FFFD, so you can move on. I'd rather just get U+FFFD to begin
> with.
OK, I've updated the Editor's Draft to reflect this. Essentially, I take Anne's advice about first converting the DOMString to a sequence of Unicode characters using the algorithm defined in WebIDL (namely this one: http://dev.w3.org/2006/webapi/WebIDL/#dfn-obtain-unicode). 

This actually seems to take care of unmatched surrogates from UTF-16 when you use a UTF-8 decoding on the Unicode characters following the algorithmic conversion, and so we may have what we need here. 

This is the 29th February Editor's Draft (ensure you shift-reload if necessary): 

http://dev.w3.org/2006/webapi/FileAPI/ 

I'd appreciate a review. If this passes muster, we may be one step further along the way to deprecating BlobBuilder, which only stipulated writing out as UTF-8 when the DOMString was appended to the Blob. 

-- A* 

Received on Wednesday, 29 February 2012 00:46:42 UTC