Re: [whatwg] StringEncoding open issues from Joshua Bell on 2012-09-17 (public-whatwg-archive@w3.org from September 2012)

From: Joshua Bell <jsbell@chromium.org>
Date: Mon, 17 Sep 2012 14:50:46 -0700
To: Anne van Kesteren <annevk@annevk.nl>
Cc: WHAT Working Group <whatwg@lists.whatwg.org>
Message-ID: <CAD649j7+y1yUvov=7nzOTeR1Kc49MyGkEpPP76k6=+J=hw9-QA@mail.gmail.com>

On Mon, Sep 17, 2012 at 2:17 PM, Anne van Kesteren <annevk@annevk.nl> wrote:

> On Mon, Sep 17, 2012 at 11:13 PM, Joshua Bell <jsbell@chromium.org> wrote:
> > I've attempted to distill the above into the spec in an algorithmic way:
> > http://wiki.whatwg.org/wiki/StringEncoding#TextDecoder
> >
> > English version: If you specify "utf-16" you get endian-agnostic UTF-16
> > encoding support. Failing that, if your encoding matches your BOM it is
> > consumed. Failing *that*, you get whatever behavior falls out of the
> decode
> > algorithm (garbage, error, etc).
>
> Why would we want the API to work different from how it works in
> markup (with <meta charset> etc.)? Granted it's not super logical, but
> I don't really see why we should make it inconsistent and more
> complicated.
>

That's how the spec started out, so a recap of this thread would give you
the back-and-forth that led here. To summarize:

Having the BOM in the content be higher priority than the coding selected
by the developer was not seen as desirable (see earlier in the thread), and
potentially a source of errors. Selecting encoding via BOM (in general, or
to emulate <meta charset>, etc) was seen as something that could be done in
user code if desired, but unexpected otherwise.

Two desired behaviors remained: (1) developer need for BOM-specified
endian-agnostic UTF-16 encoding similar to ICU's handling that
distinguishes "utf-16" from "utf-16le", and (2) that matching BOMs should
be consumed and not appear in the decoded data.

Received on Monday, 17 September 2012 21:51:15 UTC