Re: [whatwg/encoding] Fast byteLength() (Issue #333) from Andrea Giammarchi on 2025-03-14 (public-webapps-github@w3.org from March 2025)

From: Andrea Giammarchi <notifications@github.com>
Date: Fri, 14 Mar 2025 03:07:54 -0700
To: whatwg/encoding <encoding@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <whatwg/encoding/issues/333/2724225478@github.com>

WebReflection left a comment (whatwg/encoding#333)

After 2 weeks of prototyping, benchmarks, researches, I've noticed that every single library that would like to transform *JS* strings into *UTF8* buffers does any sort of workaround to avoid native `TextEncoder` and `TextDecoder` APIs as there are, in fact, "*deadly slow*" compared to home made, possibly error prone, convoluted, in-house *JS* solutions, here some example:

  * **[cbor-x](https://github.com/kriszyp/cbor-x/blob/master/encode.js#L264-L448)** claims to be one of the fastest serialized (as binary) and you can see 200 LOC (without comments) to be able to convert strings into utf-8 compatible buffers ... all that bloat to avoid *TextEncoder*, with hacks for NodeJS (where both NodeJS and Bun and I believe Deno too have their own fast way to convert JS strings to UTF8)
  * **[MessagePack](https://www.appspector.com/blog/how-to-improve-messagepack-javascript-parsing-speed-by-2-6-times)** which is also way slower than *cbor-x* despite all its hidden tricks necessary, once again, to avoid native/builtin APIs due bad performance
  * **[msgpackr](https://github.com/kriszyp/msgpackr/blob/master/pack.js#L247-L350)** which at least copy and paste some of that logic from *cbor-x* but then again, it's duplicated code due lack of a better primitive
  * **[avsc](https://github.com/mtth/avsc/blob/master/lib/utils.js#L433-L495)** dance to convert with all sort of comments/hacks + a preference for NodeJS *Buffer* utilities which are indeed way faster (couldn't we have some of those on the Web too?)
  * **BSON** and others are just super slow compared to previous libraries ... while **JSON** is not an answer because it needs to be then serialized as buffer regardless + it's not extensible like other protocol are + it's not able to deal with circular/cyclic references
  * **StructuredClone** does not provide a way to export a binary representation of its data ... NodeJS does that internally, IIRC Bun and others can do that too, yet on the Web we have no way to create a buffer and store it in *SharedArrayBuffer* (as example) + it's not extensible

Accordingly, I wonder why is it that everyone wants to speak binary UTF8 (files content, pages content, post content, Atomics and binary content) but there is no way to have it fast at the `String.prototype.toUTF8Buffer()` level so that all that repeated effort across all libraries can just disappear from the Web and we can have a better way to communicate cross programming languages too.

Sure thing libraries with extra capabilities (extension to encode/decode accordingly) would still exist, most of them are based on RFC standards, after all, but at least the cumulative amount of code and effort to bypass native APIs would help everyone around those libraries + users + network (smaller code) + developers (less to maintain + less surprises).

**As summary** couldn't we just escalate this `byteLength` issue further, so that to have the right length you gotta create that buffer and data is not decoupled *but* such conversion exist with raw performance in mind, just like NodeJS or Bun or Deno do when it comes to UTF-16 to UTF-8 strings conversion? The `byteLength` story, ad that point, would be just a nice consequence for such API.

Alternatively: can anyone please provide some context around the fact `TextEncoder` and `TextDecoder` as so slow nobody wants to use these when performance on binary serialization matters? (see all links already posted)

Thank you!

-- 
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/encoding/issues/333#issuecomment-2724225478
You are receiving this because you are subscribed to this thread.

Message ID: <whatwg/encoding/issues/333/2724225478@github.com>

Received on Friday, 14 March 2025 10:07:58 UTC