Re: Libraries assuming iso-8859-1 (was: Re: Consensus call to include Display Strings in draft-ietf-httpbis-sfbis) from Poul-Henning Kamp on 2023-05-28 (ietf-http-wg@w3.org from April to June 2023)

From: Poul-Henning Kamp <phk@phk.freebsd.dk>
Date: Sun, 28 May 2023 07:28:11 +0000
To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
cc: Mark Nottingham <mnot@mnot.net>, Roy Fielding <fielding@gbiv.com>, Tommy Pauly <tpauly@apple.com>, HTTP Working Group <ietf-http-wg@w3.org>
Message-Id: <202305280728.34S7SBAv092547@critter.freebsd.dk>

--------
Martin J. Dürst writes:

Adding base64 encoding to the table:

>                               Legacy  UTF-8   proposed  expansion  base64  b64expansion
> ASCII                        1       1       1         1           1.33    1.33
> Latin+Accents, e.g. Polish   1       ~1.5    ~2        2           2       2
> Arabic/Cyrillic/...          1       2       6         6           2.66    2.66
> Indic scripts,...            1       3       9         9           4       4
> Chinese/Japanese/...         2       3       9         4.5         4       2
>
> So some text in an Indic or South Asian Script gets expanded by a factor 
> of 9 when compared to a legacy singlebyte encoding.

Base64 does not penalize non-western languages nearly as much.

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

Received on Sunday, 28 May 2023 07:28:24 UTC