Re: UTF-8 NFC vs NFD compression, French sample (was: Updated Binary Optimized Header Encoding Draft)

Hi Frederic,

On Thu, Nov 15, 2012 at 10:58:44AM +0100, Frédéric Kayser wrote:
> Hello Martin,
> I have a short French text sample here, it's a small extract from « Le tour du monde en quatre-vingts jours » ("Around the World in Eighty Days") by Jules Verne.
> 
> The bzip2 compressed version of the NFD encoded text is smaller by 4 bytes.
> Using gzip it looks like a draw but in fact the Deflate stream itself is 4 bits shorter.
> In the other hand when using xz (lzma2) NFC gives a better result.
> 
> 2553 tdm80j-french-utf8-nfc.txt
> 2625 tdm80j-french-utf8-nfd.txt
> 
> 1312 tdm80j-french-utf8-nfc.txt.bz2
> 1308 tdm80j-french-utf8-nfd.txt.bz2
> 
> 1352 tdm80j-french-utf8-nfc.txt.gz
> 1352 tdm80j-french-utf8-nfd.txt.gz
> 
> defdb -s tdm80j-french-utf8-nfc.txt.gz
> 10671 bits
> 
> defdb -s tdm80j-french-utf8-nfd.txt.gz
> 10667 bits
> 
> Compressed files are enclosed in the zip archive attached to this email.

Do not forget that the most important for HTTP is not the compression
ratio but the compression speed. If you need a whole datacenter to
compress 1000 streams, nobody will use it. If the compression induces
delays, it will not be used either. If you check around, you'll see that
HTTP compression engines right now compress at gzip-1 to achieve the best
compression speed allowed on HTTP. And I agree with your comment in a
previous mail that gzip is totally outdated. I'd like to have much faster
compression algos such as LZ4, fastlz, etc... which are 10-100 times faster
than gzip for around the same compression ratios as gzip-1.

Cheers,
Willy

Received on Thursday, 15 November 2012 10:29:23 UTC