- From: Glenn Maynard <glenn@zewt.org>
- Date: Tue, 13 Mar 2012 19:19:24 -0500
Using Views instead of specifying the offset and length sounds good. On Tue, Mar 13, 2012 at 6:28 PM, Ian Hickson <ian at hixie.ch> wrote: > - What's the use case for supporting anything but UTF-8? > Other Unicode encodings may be useful, to decode existing file formats containing (most likely at a minimum) UTF-16. I don't feel strongly about that, though; we're stuck with UTF-16 as an internal representation in the platform, but that doesn't necessarily mean we need to support it as a transfer encoding. For non-Unicode legacy encodings, I think that even if use cases exist, they should be given more than the usual amount of scrutiny before being supported. On Tue, Mar 13, 2012 at 6:38 PM, Tab Atkins Jr. <jackalmage at gmail.com>wrote: > Python throws errors by default, but both functions have an additional > argument specifying an alternate strategy. In particular, > bytes.decode can either drop the invalid bytes, replace them with a > replacement char (which I agree should be U+FFFD), or replace them > with XML entities; str.encode can choose to drop characters the > encoding doesn't support. > Supporting throwing is okay if it's really wanted, but the default should be replacement. It reduces fatal errors to (usually) non-fatal replacement, for obscure cases that people generally don't test. It's a much more sane default failure mode. As another option, never throw, but allow returning the number of conversion errors: results = encode("abc\uD800def", outputView, "UTF-8"); where results.inputConsumed is the number of words consumed in myString, results.outputWritten is the number of UTF-8 bytes written, and results.errors is 1. That also allows block-by-block conversion; for example, to convert as many complete characters as possible into a fixed-size buffer for transmission, then starting again at the next unencoded character. One more idea, while I'm brainstorming: if outputView is null, allocate an ArrayBuffer of the necessary size, storing it in results.output. That eliminates the need for a separate length pass, without bloating the API with another overload. On Tue, Mar 13, 2012 at 6:50 PM, Joshua Bell <jsbell at chromium.org> wrote: > (Cue a strong "nooooooo!" from Anne.) > (Count me in on that, too. Heuristics bad.) Ignoring the issue of invalid code points, the length calculations for > non-UTF-8 encodings are trivial. (And with the suggestion that UTF-16 not > be sanitized, that case is trivially 2x the JS string length.) > UTF-16 "sanitization" (replacing mismatched surrogates with U+FFFD) doesn't change the size of the output, actually. -- Glenn Maynard
Received on Tuesday, 13 March 2012 17:19:24 UTC