- From: Anne van Kesteren <annevk@annevk.nl>
- Date: Wed, 4 Sep 2013 12:36:15 +0100
- To: Joshua Bell <jsbell@google.com>, Jungshik Shin (신정식, 申政湜) <jungshik@google.com>, Masatoshi Kimura <VYV03354@nifty.ne.jp>, Yui NARUSE <yui.naruse@gmail.com>
- Cc: WHATWG <whatwg@whatwg.org>
The way the utf-8, utf-16be, and utf-16le encoders are written is that they accept code points (not code units). If the code points are in the surrogate range, they raise an error. That seems problematic. Encoders for utf-8 and utf-16be, and utf-16le are assumed to be safe, because you typically forget about lone surrogates. The API deals with this by having the [EnsureUTF16] flag which converts lone surrogates into U+FFFD. So by the time code points hit the encoder they're no longer in the lone surrogate range. Gecko however has not implemented this for utf-16be and utf-16be, but has for utf-8. (Or maybe the utf-8 encoder is better.) For now I'll assume this is a bug in Gecko. I can see several options for potentially improving this setup, but I need some feedback before going there: 1. Require Unicode scalar value input for encoders, and guarantee it as decoder output. 2. Change the utf-8, utf-16be, and utf-16le encoders to emit the byte sequence for U+FFFD rather than raise an error for input in the lone surrogate range. This would simplify the API and other callers to the utf-8, utf-16be, and utf-16le encoders as they no longer need to worry about them terminating with failure. 3. Move towards defining the entire platform in terms of 16-bit code units and forget about the nicer theoretical model of Unicode scalar values. -- http://annevankesteren.nl/
Received on Wednesday, 4 September 2013 11:36:41 UTC