Re: [whatwg/encoding] New TextDecoder.decode() API needed for Wasm multithreading (#172)

> `TextDecoder.decode()` does not work with SharedArrayBuffer

On the first look, being able to use `TextDecoder.decode()` directly from wasm memory and `TextEncoder.encodeInto()` directly from and to wasm memory seems like something we should want even when wasm memory is a `SharedArrayBuffer`. However, I don't understand the Gecko implementation implications of `[AllowShared]`. What does `[AllowShared]` actually do and what requirements does it impose on the DOM-side C++/Rust code that operates on the buffer?

> The semantics of `TextDecoder.decode()` are to always convert the whole input view that is passed to the function. This means that if there exists a null byte `\0` in the middle of the view, the generated JS string will have a null UTF-16 code point in it in the middle. I.e. decoding will not stop when the first null byte is found, but continues on from there.

The semantics of `TextDecoder.decode()` make sense for Rust strings in wasm and C++ `std::string` in wasm. Only C strings in wasm have this problem. In Gecko (and, last I checked, in Blink and WebKit, too, but it's better if Blink and WebKit developers confirm) the decoders are designed to work with inputs that have explicit length and don't have any scanning for a zero byte built into the conversion loop. (The Gecko-internal legacy APIs that appear to accept C strings actually first run `strlen()` and then run the conversion step with pointer and length.)

For non-UTF-8 encodings, I'm definitely not going to be baking optional zero termination into the conversion loop, so there'd be a DOM-side prescan if the API got an option to look for a zero terminator. For UTF-8, I'm _very_ reluctant to add yet another UTF-8 to UTF-16 converter that had zero termination baked into the conversion loop, but it might be possible to persuade me to add a different UTF-8 to UTF-16 converter that interleaves zero termination into the converter if there's data to show that it would be a big win for Firefox's wasm story for wasm code that uses C strings. It _won't_ be baked in as a flag into the primary UTF-8 to UTF-16 converter, though. (For both SSE2 and aarch64 checking a 16-byte vector for ASCII is one vector instruction followed by one ALU comparison than can be branched on. Adding checking if there's a zero byte in the vector would complicate things in the hottest loop that works fine for C++ and Rust callers.)

I think C in wasm should run `strlen()` on the wasm side, hopefully with the wasm to native compiler using SSE4.2, if available, to vectorize the operation, and should surface a pointer (start index) and a length to the JS glue code layer just like C++ (if using `std::string` or similar as opposed to C strings) and Rust would.

> We'd need to make sure the decoding algorithms play nicely with it, and there is no ambiguity about a null in the middle of a multibyte sequence. (I'd assume this is designed into the encodings, but don't actually know.)

UTF-16BE and UTF-16LE are not compatible with C strings. The other encodings defined in the Encoding Standard are.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/172#issuecomment-462245138

Received on Monday, 11 February 2019 08:10:22 UTC