Re: Updated Binary Optimized Header Encoding Draft

Hello Frédéric,

On 2012/11/15 9:17, Frédéric Kayser wrote:
> Hi,
> could you precise some points regarding UTF-8 encoding ?
>
> "The next bit (E) indicates, when set, that the header field value contains UTF-8 encoded character content."
>
> - is a BOM allowed?

It does not make any sense at all as an encoding signature (we already 
know it's UTF-8). In the (hopefully very rare) cases it's present, it 
should be taken as part of the data (i.e. a ZERO WIDTH NO-BREAK SPACE).

The "UTF-8 BOM" (it's actually not a BOM, because there's no need to 
distinguish byte orders in UTF-8) is used quite a bit as an encoding 
signature for whole files. For this use case, there are vastly differing 
opinions, from "extremely useful" (more on the Windows side) to "very 
bothering" (more on the Linux side).

But for fields in protocols and formats, it's totally unnecessary, I 
don't know of anybody who's using or pushing it, and as far as I know, 
the danger that it slips in is fortunately extremely low.


> - are there restrictions concerning Unicode Normalizations Forms,

There shouldn't be, because in some cases, it may be necessary to allow 
both forms. Imagine a service that converts data from one of these forms 
to the other.

On the other hand, recommending that NFC be used when there's a choice 
is a good idea (it will mostly happen even without this recommendation, 
but the recommendation will address questions like the ones in your email.


> NFC is used most of the time but NFD could lead to smaller compressed results

Do you have any data to back this up, or is this just a guess?

NFC (Composed) is by definition always at least as short as NFD 
(Decomposed). It is probably not impossible to construct data for which 
NFD compresses better than NFC (*), but I'm having a very hard time 
imagining that NFD would compress better than NFC for a wide range of data.


> And since UTF-8 is used why stick to generic zlib/deflate for compression?
> UTF-8 encoding has some inherent characteristics http://en.wikipedia.org/wiki/UTF-8#Description

(see also http://www.sw.it.aoyama.ac.jp/2012/pub/IUC11-UTF-8.pdf, a 
paper of mine from 1997 showing how this allows easy heuristic 
distinction between UTF-8 and other encodings)

> A compression algorithm aware of those would be more efficient than deflate, Huffman encoding in deflate is context unaware (order 0) contrary to PPM (Prediction by partial matching) based algorithms.
>
> By today standards zlib/deflate is totally outdated: search window limited to 32k Bytes (can you imagine how ridiculous this is when used in nowadays PNG files, look what Google did in WebP lossless), it's dog slow to compress/decompress compared to LZ4, compressed size is far from being on par with LZMA or even bzip2, OK running an LZMA decoder is probably not the best thing to do on power and memory limited smartphones, but I wouldn't mind moving away from zlib/deflate for something closer to the compressed size vs. compression time Pareto frontier.

Do you have any particular algorithm in mind? Ideally something that is 
well established, and works not only for UTF-8. Otherwise, we need two 
separate algorithms, one for UTF-8 and one for the rest of the data.


Regards,   Martin.


P.S. (*): One line worth trying may be a language where there is a 
strong correlation between a vowel and the previous consonant, and also 
a strong correlation between the accent (e.g. tone mark) on the vowel 
and following consonants.

Because NFD treats the vowel and the accent as as separate characters, 
both of these correlations can be picked up independently. On the other 
hand, in the case of NFC, the vowel and the accent are just one 
precomposed character, which may make it more difficult to pick up the 
correlations quickly.

But this is highly speculative, and depends quite a bit on the length of 
the input and the size of the tables used,...

Received on Thursday, 15 November 2012 05:00:45 UTC