- From: Glenn Maynard <glenn@zewt.org>
- Date: Wed, 21 Mar 2012 08:54:35 -0500
On Wed, Mar 21, 2012 at 3:27 AM, Jonas Sicking <jonas at sicking.cc> wrote: > 1) Create an API which forces consumers to do state handling. Probably > leading to people creating wrappers which essentially implement option > 3 > It's not the same. Please look at how ISO-2022 works: the stream has *long-lived* state, with escape sequences that change the meaning of later code sequences in the stream. For example, you have to remember whether GR is encoding G1, G2 or G3. This can't be stored merely by remembering the next input byte you have to start at. As Yui said, the sort of state UTF-8 has isn't what people mean when we talk about "stateful encodings". On Wed, Mar 21, 2012 at 3:34 AM, NARUSE, Yui <naruse at airemix.jp> wrote: > For streaming conversion, it needs state even if the encoding is stateless. > When the given partial input is finished at the middle of a character > like "\xE3\x81\x82\xC2", the conversion consumes 4 bytes, output one > character > "\u3042", and remember the partial bytes "\xC2". This bytes is the state. > You don't need to do that. You can simply convert as many output codepoints as can be *completely* converted. In this example, you'd consume 3 bytes and output one codepoint. You don't consume data that you can't immediately convert, so you don't have to buffer anything. (We don't have to do it that way, of course; just pointing out that you don't *need* special state for streaming encodings like UTF-8.) Anyway, they need error if the byte sequence is invalid for the encoding. > Errors were discussed previously: by default errors output U+FFFD (or another replacement character, for encoding unsupported characters to non-Unicode encodings), and we may have an option to turn it into an exception. -- Glenn Maynard
Received on Wednesday, 21 March 2012 06:54:35 UTC