- From: Joshua Bell <jsbell@chromium.org>
- Date: Thu, 11 Oct 2012 09:37:46 -0700
- To: Anne van Kesteren <annevk@annevk.nl>
- Cc: WHATWG <whatwg@whatwg.org>
On Wed, Oct 10, 2012 at 11:36 PM, Anne van Kesteren <annevk@annevk.nl>wrote: > On Thu, Oct 11, 2012 at 6:09 AM, Anne van Kesteren <annevk@annevk.nl> > wrote: > > On Wed, Oct 10, 2012 at 7:28 PM, Joshua Bell <jsbell@chromium.org> > wrote: > >> Practically speaking, this would mean refactoring the combined spec so > that > >> the current BOM handling is defined for parsing web content outside of > the > >> API rather than requiring the API to hack around it. > > > > You would still get the hack because the API requires special > > treatment for "utf-16". Given that per Unicode "utf-16le" and > > "utf-16be" outlaw the BOM, maybe a good solution would be a flag to > > disable BOM handling as seen by the decode algorithm? So the decoder > > gets a disableBOM flag that defaults to false? That would only require > > a special case for BOM handling on top of what there is today, which > > seems a fair bit cleaner. > > The main problem with this is that you would get a leading BOM in > utf-8 if the content includes that. An unlikely scenario, but maybe we > want to take care of that. Another approach I thought about is that we > have an "API decode" algorithm, which is very similar to > > http://encoding.spec.whatwg.org/#decode > > However, instead of setting the encoding, it checks if the leading > bytes match, and if the encoding matches, and only then does it set > the offset. So the BOM would be skipped for utf-8/utf-16 if it was a > valid BOM, but a BOM invalid for the given encoding would never switch > the encoding. > It sounds like there are several desirable behaviors: 1. ignore BOM handling entirely (BOM would be present in output, or fatal) 2. if matching BOM, consume; otherwise, ignore (mismatching BOM would be present in output, or fatal) 3. switch encoding based on BOM (any of UTF-8, UTF-16LE, UTF-16BE) 4. switch encoding based on BOM if-and-only-if "UTF-16" explicitly specified, and only to one of the UTF-16 variants Current spec supports (2) and (4). Perhaps we should embrace this, and add another option to TextDecoderOptions : 1. { bom: "ignore" } 2. { bom: "consume" } // default? 3. { bom: "detect" } ...... and users who want #4 can use #3 and at worst if they're expecting UTF-16XX data and get UTF-8 data with a BOM it will not explode on them. > The behavior of the normal decode algorithm does not need to be > exposed through the API I think, unless a use case comes up at some > point. > That would be equivalent to #3, correct? -- Josh ObQuote: "Some days you just can't get rid of a BOM."
Received on Thursday, 11 October 2012 16:38:14 UTC