UTF-8 NFC vs NFD compression, French sample (was: Updated Binary Optimized Header Encoding Draft)

Hello Martin,
I have a short French text sample here, it's a small extract from « Le tour du monde en quatre-vingts jours » ("Around the World in Eighty Days") by Jules Verne.

The bzip2 compressed version of the NFD encoded text is smaller by 4 bytes.
Using gzip it looks like a draw but in fact the Deflate stream itself is 4 bits shorter.
In the other hand when using xz (lzma2) NFC gives a better result.

2553 tdm80j-french-utf8-nfc.txt
2625 tdm80j-french-utf8-nfd.txt

1312 tdm80j-french-utf8-nfc.txt.bz2
1308 tdm80j-french-utf8-nfd.txt.bz2

1352 tdm80j-french-utf8-nfc.txt.gz
1352 tdm80j-french-utf8-nfd.txt.gz

defdb -s tdm80j-french-utf8-nfc.txt.gz
10671 bits

defdb -s tdm80j-french-utf8-nfd.txt.gz
10667 bits

Compressed files are enclosed in the zip archive attached to this email.
Frédéric Kayser

Le 15 nov. 2012 à 06:00, Martin J. Dürst a écrit :

> Hello Frédéric,
> On 2012/11/15 9:17, Frédéric Kayser wrote:
>> NFC is used most of the time but NFD could lead to smaller compressed results
> Do you have any data to back this up, or is this just a guess?
> NFC (Composed) is by definition always at least as short as NFD (Decomposed). It is probably not impossible to construct data for which NFD compresses better than NFC (*), but I'm having a very hard time imagining that NFD would compress better than NFC for a wide range of data.


> P.S. (*): One line worth trying may be a language where there is a strong correlation between a vowel and the previous consonant, and also a strong correlation between the accent (e.g. tone mark) on the vowel and following consonants.
> Because NFD treats the vowel and the accent as as separate characters, both of these correlations can be picked up independently. On the other hand, in the case of NFC, the vowel and the accent are just one precomposed character, which may make it more difficult to pick up the correlations quickly.
> But this is highly speculative, and depends quite a bit on the length of the input and the size of the tables used,...

Received on Thursday, 15 November 2012 09:59:14 UTC