[whatwg] Encoding: lone surrogates and utf-8, utf-16be, and utf-16le encoders from Anne van Kesteren on 2013-09-04 (public-whatwg-archive@w3.org from September 2013)

From: Anne van Kesteren <annevk@annevk.nl>
Date: Wed, 4 Sep 2013 12:36:15 +0100
To: Joshua Bell <jsbell@google.com>, Jungshik Shin (신정식, 申政湜) <jungshik@google.com>, Masatoshi Kimura <VYV03354@nifty.ne.jp>, Yui NARUSE <yui.naruse@gmail.com>
Cc: WHATWG <whatwg@whatwg.org>
Message-ID: <CADnb78h1uDpShv=JFEbX9hh=A47vr6_5=qeTbyWuKOstmG9RJA@mail.gmail.com>

The way the utf-8, utf-16be, and utf-16le encoders are written is that
they accept code points (not code units). If the code points are in
the surrogate range, they raise an error.

That seems problematic. Encoders for utf-8 and utf-16be, and utf-16le
are assumed to be safe, because you typically forget about lone
surrogates.

The API deals with this by having the [EnsureUTF16] flag which
converts lone surrogates into U+FFFD. So by the time code points hit
the encoder they're no longer in the lone surrogate range.

Gecko however has not implemented this for utf-16be and utf-16be, but
has for utf-8. (Or maybe the utf-8 encoder is better.) For now I'll
assume this is a bug in Gecko.


I can see several options for potentially improving this setup, but I
need some feedback before going there:

1. Require Unicode scalar value input for encoders, and guarantee it
as decoder output.
2. Change the utf-8, utf-16be, and utf-16le encoders to emit the byte
sequence for U+FFFD rather than raise an error for input in the lone
surrogate range. This would simplify the API and other callers to the
utf-8, utf-16be, and utf-16le encoders as they no longer need to worry
about them terminating with failure.
3. Move towards defining the entire platform in terms of 16-bit code
units and forget about the nicer theoretical model of Unicode scalar
values.


-- 
http://annevankesteren.nl/

Received on Wednesday, 4 September 2013 11:36:41 UTC