[whatwg/encoding] New TextDecoder.decode() API needed for Wasm multithreading (#172)

In conjunction to WebAssembly and multithreading, there is a need for a new `TextDecoder.decode()` API for converting e.g. UTF-8 encoded strings in a typed array to a JavaScript string.

Currently to convert a string in WebAssembly heap to a JS string, one can do

```js
var textDecoder = new TextDecoder("utf8");
var wasmHeap = new UintArray(...); // coming from Wasm instantiation
var pointerToUtf8EncodedStringInHeap = 0x421341; // A UTF-8 encoded C string residing on the heap
var stringNullByteIndex = pointerToUtf8EncodedStringInHeap;
while(wasmHeap[stringNullByteIndex] != 0) ++stringNullByteIndex;
var jsString = textDecoder.decode(wasmHeap.subarray(pointerToUtf8EncodedStringInHeap, stringNullByteIndex);
```

There are three shortcomings with this API that are bugging Emscripten/asm.js/WebAssembly uses:

1. `TextDecoder.decode()` does not work with SharedArrayBuffer, so the above fails if wasmHeap viewed a SharedArrayBuffer in a multithreaded WebAssembly program.

2. `TextDecoder.decode()` needs a TypedArrayView, and it always convers the whole view. As result, one has to call `wasmHeap.subarray()` on the large wasm heap to generate a small view that only encompasses the portion of the memory that the contains the string.

3. The semantics of `TextDecoder.decode()` are to always convert the whole input view that is passed to the function. This means that if there exists a null byte `\0` in the middle of the view, the generated JS string will have a null UTF-16 code point in it in the middle. I.e. decoding will not stop when the first null byte is found, but continues on from there. This has the effect that in order to use the API from a WebAssembly program that is dealing with null-terminated C strings, JavaScript or Wasm code must first scan the whole string to find the first null byte. This is harmful for performance when dealing with long strings. It would be better to have an API where the decode size that is specified would be a `maxBytesToRead` style of size, instead of exact size. That way JS/WebAssembly did not need to pre-scan through each string to find how long the string actually is, improving performance.

This kind of code often occurs in compiled C programs, which already provide max sizes of their buffers to deal against buffer overflows in C code. That is,

```c
char str[256] = ...;
UTF8ToString(str, sizeof(str)); // Convert C string to JS string, but provide a max cap for the buffer that cannot be exceeded
```

or 

```c
size_t len = 256;
char *str = malloc(len);
UTF8ToString(str, len);
```

Having to do a O(N) scan to figure out a buffer overflow guard bound would not be ideal.

It would be good to have a new function on `TextDecoder`, e.g. 

```js
TextDecoder.decodeRange(ArrayBuffer|SharedArrayBuffer|TypedArrayView, startIndex, [optional: maxElementsToRead]);`
```

which would allow reading from SharedArrayBuffers, took in startIdx to the array, and optionally the max number of elements to read. This is parallel to what was done to WebGL 2, with the advent of WebAssembly and multithreading: all entry points in WebGL 2 dealing with typed arrays accumulated a new variant of the function that take in SharedArrayBuffers and do not produce temporary garbage: https://www.khronos.org/registry/webgl/specs/latest/2.0/#3.7

Also in case of WebGL, all API entry points were retroactively re-specced to allow SharedArrayBuffers and SharedArrayViews in addition to regular typed arrays and views. That would be nice to happen with `decode()` as well, although if so, it should probably happen exactly at the same time as a new function `decodeRange()` was added, so that code can feature test via the presence of `decodeRange()` if the old `decode()` function has been improved or not.

With such a new `decodeRange()` function, the JS code at the very top would transform to

```js
var textDecoder = new TextDecoder("utf8");
var wasmHeap = new UintArray(...); // coming from Wasm instantiation
var pointerToUtf8EncodedStringInHeap = 0x421341; // A UTF-8 encoded C string residing on the heap
var jsString = textDecoder.decode(wasmHeap, pointerToUtf8EncodedStringInHeap);
```

which would work with multithreading, improve performance, and be smaller code size.

The reason why this is somewhat important is that with Emscripten and WebAssembly, marshalling strings across wasm and JS language barriers is really common, something most Emscripten compiled applications are doing, and in the absence of multithreading capable text marshalling, applications that need string marshalling have to resort to a manual JS side implementation that loops and appends `String.fromCharCode()`s character by character:

https://github.com/emscripten-core/emscripten/blob/c2b3c49f71ab98fbd9ff829d6cbd30445b56a93e/src/runtime_strings.js#L98

It would be good for that code to be able to go away.

CC @kripken, @dschuff, @lars-t-hansen, @binji, @lukewagner, @titzer, @bnjbvr, @aheejin , who have been working on WebAssembly multithreading.


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/172

Received on Wednesday, 6 February 2019 21:10:56 UTC