- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Thu, 15 Nov 2012 14:00:09 +0900
- To: Frédéric Kayser <f.kayser@free.fr>
- CC: ietf-http-wg@w3.org, James M Snell <jasnell@gmail.com>
Hello Frédéric, On 2012/11/15 9:17, Frédéric Kayser wrote: > Hi, > could you precise some points regarding UTF-8 encoding ? > > "The next bit (E) indicates, when set, that the header field value contains UTF-8 encoded character content." > > - is a BOM allowed? It does not make any sense at all as an encoding signature (we already know it's UTF-8). In the (hopefully very rare) cases it's present, it should be taken as part of the data (i.e. a ZERO WIDTH NO-BREAK SPACE). The "UTF-8 BOM" (it's actually not a BOM, because there's no need to distinguish byte orders in UTF-8) is used quite a bit as an encoding signature for whole files. For this use case, there are vastly differing opinions, from "extremely useful" (more on the Windows side) to "very bothering" (more on the Linux side). But for fields in protocols and formats, it's totally unnecessary, I don't know of anybody who's using or pushing it, and as far as I know, the danger that it slips in is fortunately extremely low. > - are there restrictions concerning Unicode Normalizations Forms, There shouldn't be, because in some cases, it may be necessary to allow both forms. Imagine a service that converts data from one of these forms to the other. On the other hand, recommending that NFC be used when there's a choice is a good idea (it will mostly happen even without this recommendation, but the recommendation will address questions like the ones in your email. > NFC is used most of the time but NFD could lead to smaller compressed results Do you have any data to back this up, or is this just a guess? NFC (Composed) is by definition always at least as short as NFD (Decomposed). It is probably not impossible to construct data for which NFD compresses better than NFC (*), but I'm having a very hard time imagining that NFD would compress better than NFC for a wide range of data. > And since UTF-8 is used why stick to generic zlib/deflate for compression? > UTF-8 encoding has some inherent characteristics http://en.wikipedia.org/wiki/UTF-8#Description (see also http://www.sw.it.aoyama.ac.jp/2012/pub/IUC11-UTF-8.pdf, a paper of mine from 1997 showing how this allows easy heuristic distinction between UTF-8 and other encodings) > A compression algorithm aware of those would be more efficient than deflate, Huffman encoding in deflate is context unaware (order 0) contrary to PPM (Prediction by partial matching) based algorithms. > > By today standards zlib/deflate is totally outdated: search window limited to 32k Bytes (can you imagine how ridiculous this is when used in nowadays PNG files, look what Google did in WebP lossless), it's dog slow to compress/decompress compared to LZ4, compressed size is far from being on par with LZMA or even bzip2, OK running an LZMA decoder is probably not the best thing to do on power and memory limited smartphones, but I wouldn't mind moving away from zlib/deflate for something closer to the compressed size vs. compression time Pareto frontier. Do you have any particular algorithm in mind? Ideally something that is well established, and works not only for UTF-8. Otherwise, we need two separate algorithms, one for UTF-8 and one for the rest of the data. Regards, Martin. P.S. (*): One line worth trying may be a language where there is a strong correlation between a vowel and the previous consonant, and also a strong correlation between the accent (e.g. tone mark) on the vowel and following consonants. Because NFD treats the vowel and the accent as as separate characters, both of these correlations can be picked up independently. On the other hand, in the case of NFC, the vowel and the accent are just one precomposed character, which may make it more difficult to pick up the correlations quickly. But this is highly speculative, and depends quite a bit on the length of the input and the size of the tables used,...
Received on Thursday, 15 November 2012 05:00:45 UTC