Re: [whatwg] Encoding: API from Anne van Kesteren on 2012-10-11 (public-whatwg-archive@w3.org from October 2012)

From: Anne van Kesteren <annevk@annevk.nl>
Date: Thu, 11 Oct 2012 08:36:36 +0200
To: Joshua Bell <jsbell@chromium.org>
Cc: WHATWG <whatwg@whatwg.org>
Message-ID: <CADnb78gOiYbw=1st+q3ghy3CBMjJ6dOLjqn4u=OSBvEkacpHnA@mail.gmail.com>

On Thu, Oct 11, 2012 at 6:09 AM, Anne van Kesteren <annevk@annevk.nl> wrote:
> On Wed, Oct 10, 2012 at 7:28 PM, Joshua Bell <jsbell@chromium.org> wrote:
>> Practically speaking, this would mean refactoring the combined spec so that
>> the current BOM handling is defined for parsing web content outside of the
>> API rather than requiring the API to hack around it.
>
> You would still get the hack because the API requires special
> treatment for "utf-16". Given that per Unicode "utf-16le" and
> "utf-16be" outlaw the BOM, maybe a good solution would be a flag to
> disable BOM handling as seen by the decode algorithm? So the decoder
> gets a disableBOM flag that defaults to false? That would only require
> a special case for BOM handling on top of what there is today, which
> seems a fair bit cleaner.

The main problem with this is that you would get a leading BOM in
utf-8 if the content includes that. An unlikely scenario, but maybe we
want to take care of that. Another approach I thought about is that we
have an "API decode" algorithm, which is very similar to

http://encoding.spec.whatwg.org/#decode

However, instead of setting the encoding, it checks if the leading
bytes match, and if the encoding matches, and only then does it set
the offset. So the BOM would be skipped for utf-8/utf-16 if it was a
valid BOM, but a BOM invalid for the given encoding would never switch
the encoding.

The behavior of the normal decode algorithm does not need to be
exposed through the API I think, unless a use case comes up at some
point.

-- 
http://annevankesteren.nl/

Received on Thursday, 11 October 2012 06:37:09 UTC