The (not so) great base-encoding debate of 2020 (was: Re: Question on use of base64 vs base64url in modern specifications) from Manu Sporny on 2020-04-28 (public-credentials@w3.org from April 2020)

From: Manu Sporny <msporny@digitalbazaar.com>
Date: Tue, 28 Apr 2020 00:24:45 -0400
To: public-credentials@w3.org
Message-ID: <a4b1572e-3df4-550b-f5db-dabb24920990@digitalbazaar.com>
> Given that *none* of the options mentioned below (Base58 & its 
> variants, Bech32, multihash, etc.) are standardized by any
> recognized SDO –  nor are any of them even on an active standards
> track - why would you use them?

For the same reason you use any technology before it becomes a standard:

It's measurably better than the status quo, there are key communities
adopting the technology, and there are a group of people that are
committed to making it a standard. :)

Let's look at some data, which I generated based on the discussion in
this thread. The data below shows what a base64, base64url, base58, and
bech32 encoding of a value looks like for random byte values of 4, 8,
16, and 32 bytes. They are, in general, in ascending order by size. Each
line specifies how much bigger the encoding is based on the baseline
size. Each grouping has an associated analysis, because this isn't just
about human readability, it's also about developer copyability,
filesystem filename encoding, and encoding size. With that in mind,
let's begin...

In general, these things hold true for all of the tests:

* You cannot double-click copy-paste base64 and base64url values,
   which developers need to do often, which is what makes them bad
   choices for DIDs. I know that I copy/paste DIDs while developing
   quite a bit and am always annoyed by the DID Methods that make
   this difficult.
* Base64 is unsafe for filenames, and DIDs are often written to
   filenames.

4 random bytes
base64url:  Fd-j-A baseline
base58   :  ZRrnb -17% larger
base64   :  Fd+j+A== 33% larger
bech32   :  1zh0687q7xwhau 133% larger

One of the first things that pops out above is that base58 encoding is
actually *more efficient* than base64 (because of base64 padding), and
even base64url without padding (because base58 has some nice bit packing
characteristics for small values).

8 random bytes
base64url:  cbaupa7qfVo baseline
base58   :  L2AXzqFbepH 0% larger
base64   :  cbaupa7qfVo= 9% larger
bech32   :  1wxm2afdwaf745vh2ud8 81% larger

For 8 byte values, base64url and base58 are equivalent from a storage
efficiency standpoint.

16 random bytes
base64url:  CyTZwJimleWCJxlmaMNvJw baseline
base58   :  2NpD3dQYuV6ZaxMCDzsq4S 0% larger
base64   :  CyTZwJimleWCJxlmaMNvJw== 9% larger
bech32   :  1pvjdnsyc5627tq38r9nx3sm0yu866x99 50% larger

For 16 bytes values, the storage efficiency still holds for base58,
making it equivalent in size to base64url. Note that base58 will always
use unambiguous characters, but more importantly, it will always be
copy-pasteable... whereas, base64url will be copyable sometimes, and
other times, a double click will result in a bad copy/paste (because of
a breaking character in the base64url value). The number of times that
this has bitten me while copy-pasting an AWS client secret resulting in
scripts failing and minutes (to sometimes hours) wasted because of a
base64url encoding issue has been a constant source of frustration over
the years.

32 random bytes
base64url:  i1kbaCq6eZEYWqCKLzL3Aafv-pegrR-O1y3sRJLKd14 baseline
base58   :  ANxUehLobX2wPMyyiZp834KgvZXvg7hHiBK6GeZvgG1T 2% larger
base64   :  i1kbaCq6eZEYWqCKLzL3Aafv+pegrR+O1y3sRJLKd14= 2% larger
bech32   :  13dv3k6p2hfuezxz65z9z7vhhqxn7l75h5zk3lrkh9hkyfyk2wa0qpd3upn
37% larger

The "advantage" of base64url starts to shine through once we hit 32
bytes, with a 2% encoding benefit over base58... which is the trade off
for an inconsistently copyable string of characters that developers find
themselves copying often during development.

As for the benefits of bech32, I honestly don't see it... yes, there is
error correction, but once you get to 32 bytes, you've added close to
40% overhead... doesn't seem worth it to me unless you know a human
being is going to be reading the value and something bad is going to
happen if they get it wrong (payment going to wrong address, for example).

So, the priorities that I've heard most often are:

1. Ease of copy/paste for developers.
2. Encodes directly as a file on a file system.
3. Size efficiency.
4. Human readability.

Is this an esoteric discussion? Absolutely... but it goes to the heart
of why developers feel strongly about this particular choice. They live
and breath how this stuff is encoded and it has a direct impact on their
productivity and the correctness of the programs that they write and run.

-- manu

-- 
Manu Sporny (skype: msporny, twitter: manusporny)
Founder/CEO - Digital Bazaar, Inc.
blog: Veres One Decentralized Identifier Blockchain Launches
https://tinyurl.com/veres-one-launches
Received on Tuesday, 28 April 2020 04:25:02 UTC