- From: Joshua Bell <jsbell@chromium.org>
- Date: Tue, 14 Aug 2012 10:34:51 -0700
- To: WHAT Working Group <whatwg@lists.whatwg.org>
On Mon, Aug 6, 2012 at 5:06 PM, Glenn Maynard <glenn@zewt.org> wrote: > I agree with Jonas that encoding should just use a replacement character > (U+FFFD for Unicode encodings, '?' otherwise), and that we should put off > other modes (eg. exceptions and user-specified replacement characters) > until there's a clear need. > > My intuition is that encoding DOMString to UTF-16 should never have errors; > if there are dangling surrogates, pass them through unchanged. There's no > point in using a placeholder that says "an error occured here", when the > error can be passed through in exactly the same form (not possible with eg. > DOMString->SJIS). I don't feel strongly about this only because outputting > UTF-16 is so rare to begin with. > > On Mon, Aug 6, 2012 at 1:29 PM, Joshua Bell <jsbell@chromium.org> wrote: > > > - if the document is encoded in UTF-8, UTF-16LE or UTF-16BE and includes > > the byte order mark (the encoding-specific serialization of U+FEFF). > > > This rarely detects the wrong type, but that doesn't mean it's not the > wrong answer. If my input is meant to be UTF-8, and someone hands me > BOM-marked UTF-16, I want it to fail in the same way it would if someone > passed in SJIS. I don't want it silently translated. > > On the other hand, it probably does make sense for UTF-16 to switch to > UTF-16BE, since that's by definition the original purpose of the BOM. > > The convention iconv uses, which I think is a useful one, is decoding from > "UTF-16" means "try to figure out the encoding from the BOM, if any", and > "UTF-16LE" and "UTF-16BE" mean "always use this exact encoding". Let me take a crack at making this into an algorithm: In the TextDecoder constructor: - If encoding is not specified, set an internal useBOM flag - If encoding is specified and is a case insensitive match for "utf-16" set an internal useBOM flag. NOTE: This means if "utf-8", "utf-16le" or "utf-16be" is explicitly specified the flag is not set. When decode() is called - If useBOM is set and the stream offset is 0, then - If there are not enough bytes to test for a BOM then return without emitting anything (NOTE: if not streaming an EOF byte would be present in the stream which would be a negative match for a BOM) - If encoding is "utf-16" and the first bytes match 0xFF 0xFE or 0xFE 0xFF then set current encoding to "utf-16" or "utf-16be" respectively and advance the stream past the BOM. The current encoding is used until the stream is reset. - Otherwise, if the first bytes match 0xFF 0xFE, 0xFE 0xFF, or 0xEF 0xBB 0xBF then set current encoding to "utf-16", "utf-16be" or "utf-8" respectively and advance the stream past the BOM. The current encoding is used until the stream is reset. - Otherwise, if useBOM is not set and the steam offset is 0, then if the encoding is "utf-8", "utf-16" or "utf-16be" - If the first bytes match 0xFF 0xFE, 0xFE 0xFF, or 0xEF 0xBB 0xBF then let detected encoding be "utf-16", "utf-16be" or "utf-8" respectively. If the detected encoding matches the object's encoding, advance the stream past the BOM. Otherwise, if the fatal flag is set then throw a "EncodingError" DOMException. Otherwise, the decoding algorithm proceeds. - If there are not enough bytes to test for a BOM then return without emitting anything (NOTE: if not streaming an EOF byte would be inserted which would be a negative match for a BOM) Working the "current encoding" switcheroo into the spec will require some refactoring, so trying to get consensus here first. In English: - Create an encoder with TextDecoder() and if present a BOM will be respected (and consumed) otherwise default to UTF-8 - Create an encoder with TextDecoder("utf-16") and either UTF-16LE or UTF-16BE BOM will be respected (and consumed) otherwise default to UTF-16LE (which may decode garbage if UTF-8 BOM or other non-UTF-16 data is present) - Create an encoder with TextDecoder("utf-8", {fatal:true}), TextDecoder("utf-16le", {fatal:true}), TextDecoder("utf-16be", {fatal:true}) and a matching BOM will be consumed, a mismatching BOM will throw an EncodingError - Create an encoder with TextDecoder("utf-8"), TextDecoder("utf-16le"), TextDecoder("utf-16be") and a matching BOM will be consumed, a mismatching BOM will be blithely decoded (probably giving you replacement characters), but not throwing. * If one of the UTF encodings is specified AND the BOM matches then the > > leading BOM character (U+FEFF) MUST NOT be emitted in the output > character > > sequence (i.e. it is silently consumed) > > > > It's a little weird that > > data = readFile("user-supplied-file.txt"); // shortcutting for brevity > var s = new TextDecoder("utf-16").decode(data); // or utf-8 > s = s.replace("a", "b"); > var data2 = new TextEncoder("utf-16").encode(s); > writeFile("user-supplied-file.txt", data2); > > causes the BOM to be quietly stripped away. Normally if you're modifying a > file, you want to pass through the BOM (or lack thereof) untouched. > > One way to deal with this could be: > > var decoder = new TextDecoder("utf-16"); > var s = decoder.decode(data); > s = s.replace("a", "b"); > var data2 = new TextEncoder(decoder.encoding).encode(s); > > where decoder.encoding is eg. "UTF-16LE-BOM" if a BOM was present, thus > preserving both the BOM and (for UTF-16) endianness. I don't actually like > this, though, because I don't like the idea of decoder.encoding changing > after the decoder has already been constructed. > > I think I agree with just stripping it, and people who want to preserve > BOMs on write-through can jump the hoops manually (which aren't terribly > hard). > This gets easier if we restrict to encoding UTF-8 which typically doesn't include BOMs. But it's looking like there's enough desire to keep UTF-16 encoding at the moment. Agree with just stripping it for now. > Another issue is "new TextDecoder('ascii').encoding" (and ISO-8859-1) > giving .encoding = "windows-1252". That's strange, even when you know why > it's happening. > > Is there any reason to expose the actual "primary" names? It's not clear > that the "name" column in the Encoding spec is even intended to be exposed > to APIs; they look more like labels for specs to refer to internally. > (Anne?) If there's no pressing reason to expose this, I'd suggest that the > .encoding attribute simply return the name that was passed to the > constructor. > > It's still not ideal (it's weird that asking for ASCII gives you something > other than ASCII in the first place), but it at least seems a bit less > strange. The "nice" fix would be to implement actual ASCII, ISO-8859-1, > ISO-8859-9, etc. charsets, but that just means extra implementation work > (and some charset proliferation) without use cases. > Leaning towards simply dropping the attribute. Does anyone advocate for keeping it?
Received on Tuesday, 14 August 2012 17:35:19 UTC