Re: [whatwg] StringEncoding open issues

On Fri, Aug 17, 2012 at 5:19 PM, Jonas Sicking <jonas@sicking.cc> wrote:

> On Fri, Aug 17, 2012 at 7:15 AM, Glenn Maynard <glenn@zewt.org> wrote:
> > On Fri, Aug 17, 2012 at 2:23 AM, Jonas Sicking <jonas@sicking.cc> wrote:
> >>
> >> >       - If encoding is "utf-16" and the first bytes match 0xFF 0xFE or
> >> > 0xFE
> >> >       0xFF then set current encoding to "utf-16" or "utf-16be"
> >> > respectively and
> >> >       advance the stream past the BOM. The current encoding is used
> >> > until the
> >> >       stream is reset.
> >> >       - Otherwise, if the first bytes match 0xFF 0xFE, 0xFE 0xFF, or
> >> > 0xEF
> >> >       0xBB 0xBF then set current encoding to "utf-16", "utf-16be" or
> >> > "utf-8"
> >> >       respectively and advance the stream past the BOM. The current
> >> > encoding is
> >> >       used until the stream is reset.
> >>
> >> This doesn't sound right. The effect of the rules so far would be that
> >> if you create a decoder and specify "utf-16" as encoding, and the
> >> first bytes in the stream are 0xEF 0xBB 0xBF you'd silently switch to
> >> "utf-8" decoding.
> >
> > I think the scope of the "otherwise" is unclear, and this is meant to be
> > "otherwise (if encoding is not "utf-16")".
>
> Ah, that would make sense. It effectively means "if encoding is not set".
>
> / Jonas
>

I've attempted to distill the above into the spec in an algorithmic way:
http://wiki.whatwg.org/wiki/StringEncoding#TextDecoder

English version: If you specify "utf-16" you get endian-agnostic UTF-16
encoding support. Failing that, if your encoding matches your BOM it is
consumed. Failing *that*, you get whatever behavior falls out of the decode
algorithm (garbage, error, etc).

The JS shim has *not* been updated yet.

Only part of this edit has been live for the last few weeks - apologies to
the Moz folks who were trying to understand what the half-specified
internal useBOM flag was for. Any implementer feedback so far?

Received on Monday, 17 September 2012 21:13:35 UTC