Re: [whatwg] Encoding: API from Joshua Bell on 2012-10-11 (public-whatwg-archive@w3.org from October 2012)

From: Joshua Bell <jsbell@chromium.org>
Date: Thu, 11 Oct 2012 09:37:46 -0700
To: Anne van Kesteren <annevk@annevk.nl>
Cc: WHATWG <whatwg@whatwg.org>
Message-ID: <CAD649j5N9e29N3zvcyEW-ZAZD4nKDM=t4gzwtU0_qHs4GjyENA@mail.gmail.com>

On Wed, Oct 10, 2012 at 11:36 PM, Anne van Kesteren <annevk@annevk.nl>wrote:

> On Thu, Oct 11, 2012 at 6:09 AM, Anne van Kesteren <annevk@annevk.nl>
> wrote:
> > On Wed, Oct 10, 2012 at 7:28 PM, Joshua Bell <jsbell@chromium.org>
> wrote:
> >> Practically speaking, this would mean refactoring the combined spec so
> that
> >> the current BOM handling is defined for parsing web content outside of
> the
> >> API rather than requiring the API to hack around it.
> >
> > You would still get the hack because the API requires special
> > treatment for "utf-16". Given that per Unicode "utf-16le" and
> > "utf-16be" outlaw the BOM, maybe a good solution would be a flag to
> > disable BOM handling as seen by the decode algorithm? So the decoder
> > gets a disableBOM flag that defaults to false? That would only require
> > a special case for BOM handling on top of what there is today, which
> > seems a fair bit cleaner.
>
> The main problem with this is that you would get a leading BOM in
> utf-8 if the content includes that. An unlikely scenario, but maybe we
> want to take care of that. Another approach I thought about is that we
> have an "API decode" algorithm, which is very similar to
>
> http://encoding.spec.whatwg.org/#decode
>
> However, instead of setting the encoding, it checks if the leading
> bytes match, and if the encoding matches, and only then does it set
> the offset. So the BOM would be skipped for utf-8/utf-16 if it was a
> valid BOM, but a BOM invalid for the given encoding would never switch
> the encoding.
>

It sounds like there are several desirable behaviors:

1. ignore BOM handling entirely (BOM would be present in output, or fatal)
2. if matching BOM, consume; otherwise, ignore (mismatching BOM would be
present in output, or fatal)
3. switch encoding based on BOM (any of UTF-8, UTF-16LE, UTF-16BE)
4. switch encoding based on BOM if-and-only-if "UTF-16" explicitly
specified, and only to one of the UTF-16 variants

Current spec supports (2) and (4).

Perhaps we should embrace this, and add another option to TextDecoderOptions
:

1. { bom: "ignore" }
2. { bom: "consume" } // default?
3. { bom: "detect" }

...... and users who want #4 can use #3 and at worst if they're expecting
UTF-16XX data and get UTF-8 data with a BOM it will not explode on them.


> The behavior of the normal decode algorithm does not need to be
> exposed through the API I think, unless a use case comes up at some
> point.
>

That would be equivalent to #3, correct?

-- Josh

ObQuote: "Some days you just can't get rid of a BOM."

Received on Thursday, 11 October 2012 16:38:14 UTC