- From: Anne van Kesteren <annevk@annevk.nl>
- Date: Thu, 11 Oct 2012 08:36:36 +0200
- To: Joshua Bell <jsbell@chromium.org>
- Cc: WHATWG <whatwg@whatwg.org>
On Thu, Oct 11, 2012 at 6:09 AM, Anne van Kesteren <annevk@annevk.nl> wrote: > On Wed, Oct 10, 2012 at 7:28 PM, Joshua Bell <jsbell@chromium.org> wrote: >> Practically speaking, this would mean refactoring the combined spec so that >> the current BOM handling is defined for parsing web content outside of the >> API rather than requiring the API to hack around it. > > You would still get the hack because the API requires special > treatment for "utf-16". Given that per Unicode "utf-16le" and > "utf-16be" outlaw the BOM, maybe a good solution would be a flag to > disable BOM handling as seen by the decode algorithm? So the decoder > gets a disableBOM flag that defaults to false? That would only require > a special case for BOM handling on top of what there is today, which > seems a fair bit cleaner. The main problem with this is that you would get a leading BOM in utf-8 if the content includes that. An unlikely scenario, but maybe we want to take care of that. Another approach I thought about is that we have an "API decode" algorithm, which is very similar to http://encoding.spec.whatwg.org/#decode However, instead of setting the encoding, it checks if the leading bytes match, and if the encoding matches, and only then does it set the offset. So the BOM would be skipped for utf-8/utf-16 if it was a valid BOM, but a BOM invalid for the given encoding would never switch the encoding. The behavior of the normal decode algorithm does not need to be exposed through the API I think, unless a use case comes up at some point. -- http://annevankesteren.nl/
Received on Thursday, 11 October 2012 06:37:09 UTC