Re: [whatwg/encoding] TextEncoder#encode - write to existing Uint8Array (#69) from Henri Sivonen on 2018-11-01 (public-webapps-github@w3.org from November 2018)

From: Henri Sivonen <notifications@github.com>
Date: Thu, 01 Nov 2018 04:28:01 -0700
To: whatwg/encoding <encoding@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/encoding/issues/69/435012394@github.com>

Sorry about the delay.

> I'd be unhappy to have to do that for all encodings, though, so let's be careful if we ever get a request for decoding to `ArrayBuffer`s.

After thinking about this more, I've come up with a way to accommodate filling output buffers as much as logically possible in a way that adds complexity only in a wrapper for encoding_rs and not in the internals of the encoding_rs converters, so I withdraw my previous concern about filling a caller-provided output buffer as much as possible in the decode case. (Basically: Trying the last code units speculatively into a temporary buffer using a clone of the encoding_rs converters internal state, discarding the clone if it output too much and promoting the clone into the main converter state if the output could still fit into to caller-provided buffer.)

So I'd be OK with us exposing an encoding_rs-like streaming API that, unlike the encoding_rs API, fills buffers as much as logically possible without splitting a Unicode scalar value.

The encoding_rs streaming API takes the converter state as `self`/`this` plus three other arguments: input buffer to read from, output buffer to write to and a boolean indicating whether eof occurs immediately after the input buffer is exhausted. It returns a status, how many code units were read, how many code units were written. Status can be "input empty" or "output full".

If performing replacement, the return tuple has a boolean indicating whether replacements were performed. When not performing replacement, the status has a third state that encapsulates error (when encoding, the unmappable scalar value; when decoding, the length of the illegal byte sequence and how many bytes in the past the illegal sequence was). Identifying the erroneous byte sequence probably shouldn't be done on the Web, because other back ends will have a hard time getting it right.

This streaming API never allocates on the heap.

* [Decode to UTF-8 without replacement](https://docs.rs/encoding_rs/0.8.10/encoding_rs/struct.Decoder.html#method.decode_to_utf8_without_replacement)
* [Decode to UTF-8 with replacement](https://docs.rs/encoding_rs/0.8.10/encoding_rs/struct.Decoder.html#method.decode_to_utf8)
* [Decode to UTF-16 without replacement](https://docs.rs/encoding_rs/0.8.10/encoding_rs/struct.Decoder.html#method.decode_to_utf16_without_replacement)
* [Decode to UTF-16 with replacement](https://docs.rs/encoding_rs/0.8.10/encoding_rs/struct.Decoder.html#method.decode_to_utf16)
* [Encode from valid UTF-8 without replacement](https://docs.rs/encoding_rs/0.8.10/encoding_rs/struct.Encoder.html#method.encode_from_utf8_without_replacement)
* [Encode from valid UTF-8 with replacement](https://docs.rs/encoding_rs/0.8.10/encoding_rs/struct.Encoder.html#method.encode_from_utf8)
* [Encode from potentially-invalid UTF-16 with unpaired surrogates silently replaced with the REPLACEMENT CHARACTER without replacement of unmappables](https://docs.rs/encoding_rs/0.8.10/encoding_rs/struct.Encoder.html#method.encode_from_utf16_without_replacement)
* [Encode from potentially-invalid UTF-16 with unpaired surrogates silently replaced with the REPLACEMENT CHARACTER with replacement of unmappables](https://docs.rs/encoding_rs/0.8.10/encoding_rs/struct.Encoder.html#method.encode_from_utf16)

If providing an incremental API, experience with the uconv and encoding_rs APIs (which differ on this design point) strongly indicates that doing the EOF signaling the encoding_rs way is superior compared to either not signaling it (bogus API) or having separate API surface for it (additional API surface that can produce output).

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/69#issuecomment-435012394

Received on Thursday, 1 November 2018 11:28:22 UTC