- From: Jonas Sicking <jonas@sicking.cc>
- Date: Fri, 17 Aug 2012 00:23:18 -0700
- To: Joshua Bell <jsbell@chromium.org>
- Cc: WHAT Working Group <whatwg@lists.whatwg.org>
On Tue, Aug 14, 2012 at 10:34 AM, Joshua Bell <jsbell@chromium.org> wrote: > On Mon, Aug 6, 2012 at 5:06 PM, Glenn Maynard <glenn@zewt.org> wrote: > >> I agree with Jonas that encoding should just use a replacement character >> (U+FFFD for Unicode encodings, '?' otherwise), and that we should put off >> other modes (eg. exceptions and user-specified replacement characters) >> until there's a clear need. >> >> My intuition is that encoding DOMString to UTF-16 should never have errors; >> if there are dangling surrogates, pass them through unchanged. There's no >> point in using a placeholder that says "an error occured here", when the >> error can be passed through in exactly the same form (not possible with eg. >> DOMString->SJIS). I don't feel strongly about this only because outputting >> UTF-16 is so rare to begin with. >> >> On Mon, Aug 6, 2012 at 1:29 PM, Joshua Bell <jsbell@chromium.org> wrote: >> >> > - if the document is encoded in UTF-8, UTF-16LE or UTF-16BE and includes >> > the byte order mark (the encoding-specific serialization of U+FEFF). >> >> >> This rarely detects the wrong type, but that doesn't mean it's not the >> wrong answer. If my input is meant to be UTF-8, and someone hands me >> BOM-marked UTF-16, I want it to fail in the same way it would if someone >> passed in SJIS. I don't want it silently translated. >> >> On the other hand, it probably does make sense for UTF-16 to switch to >> UTF-16BE, since that's by definition the original purpose of the BOM. >> >> The convention iconv uses, which I think is a useful one, is decoding from >> "UTF-16" means "try to figure out the encoding from the BOM, if any", and >> "UTF-16LE" and "UTF-16BE" mean "always use this exact encoding". > > > Let me take a crack at making this into an algorithm: > > In the TextDecoder constructor: > > - If encoding is not specified, set an internal useBOM flag > - If encoding is specified and is a case insensitive match for "utf-16" > set an internal useBOM flag. > > NOTE: This means if "utf-8", "utf-16le" or "utf-16be" is explicitly > specified the flag is not set. > > When decode() is called > > - If useBOM is set and the stream offset is 0, then > - If there are not enough bytes to test for a BOM then return without > emitting anything (NOTE: if not streaming an EOF byte would be present in > the stream which would be a negative match for a BOM) > - If encoding is "utf-16" and the first bytes match 0xFF 0xFE or 0xFE > 0xFF then set current encoding to "utf-16" or "utf-16be" respectively and > advance the stream past the BOM. The current encoding is used until the > stream is reset. > - Otherwise, if the first bytes match 0xFF 0xFE, 0xFE 0xFF, or 0xEF > 0xBB 0xBF then set current encoding to "utf-16", "utf-16be" or "utf-8" > respectively and advance the stream past the BOM. The current encoding is > used until the stream is reset. This doesn't sound right. The effect of the rules so far would be that if you create a decoder and specify "utf-16" as encoding, and the first bytes in the stream are 0xEF 0xBB 0xBF you'd silently switch to "utf-8" decoding. / Jonas
Received on Friday, 17 August 2012 07:25:20 UTC