- From: Tab Atkins Jr. <jackalmage@gmail.com>
- Date: Wed, 4 Sep 2013 11:38:44 -0700
- To: Anne van Kesteren <annevk@annevk.nl>
- Cc: Yui NARUSE <yui.naruse@gmail.com>, Masatoshi Kimura <VYV03354@nifty.ne.jp>, Jungshik Shin (신정식, 申政湜) <jungshik@google.com>, Joshua Bell <jsbell@google.com>, WHATWG <whatwg@whatwg.org>
On Wed, Sep 4, 2013 at 4:36 AM, Anne van Kesteren <annevk@annevk.nl> wrote: > The way the utf-8, utf-16be, and utf-16le encoders are written is that > they accept code points (not code units). If the code points are in > the surrogate range, they raise an error. > > That seems problematic. Encoders for utf-8 and utf-16be, and utf-16le > are assumed to be safe, because you typically forget about lone > surrogates. > > The API deals with this by having the [EnsureUTF16] flag which > converts lone surrogates into U+FFFD. So by the time code points hit > the encoder they're no longer in the lone surrogate range. > > Gecko however has not implemented this for utf-16be and utf-16be, but > has for utf-8. (Or maybe the utf-8 encoder is better.) For now I'll > assume this is a bug in Gecko. > > > I can see several options for potentially improving this setup, but I > need some feedback before going there: > > 1. Require Unicode scalar value input for encoders, and guarantee it > as decoder output. > 2. Change the utf-8, utf-16be, and utf-16le encoders to emit the byte > sequence for U+FFFD rather than raise an error for input in the lone > surrogate range. This would simplify the API and other callers to the > utf-8, utf-16be, and utf-16le encoders as they no longer need to worry > about them terminating with failure. > 3. Move towards defining the entire platform in terms of 16-bit code > units and forget about the nicer theoretical model of Unicode scalar > values. I prefer option 2 - CSS is now defined to do the same thing when parsed (nulls, lone surrogates, and out-of-range codepoints are all converted to u+fffd). ~TJ
Received on Wednesday, 4 September 2013 18:39:28 UTC