Re: Strange advice re BOM and UTF-8

At 04:11 06/12/07, Addison Phillips wrote:
>
>Ira wrote:
>
> > (a) it's useless as a signature (a small fragment of
> >     UTF-8 can be reliably auto-detected without BOM);
>
>Actually, that's not quite true. A UTF-8 detector can indeed validate whether some block of bytes would be valid as UTF-8 (or not). However, *small* runs of text using certain non-UTF-8 encodings (notably multibyte encodings such as EUC) can fool this test. Encodings that can mimic UTF-8 are as statistically unlikely to do so over a run longer than a few character as other encodings. However, fields such as a person's given or family name in the languages commonly encoded by these encodings can fool a UTF-8 detector with pretty high frequency, if you blindly rely on the bit pattern, since the text tends to be short anyway.

Agreed. But then again, single fields don't usually occur as independent
files, and in other contexts (protocol fields, database fields) you
certainly don't want a BOM.
(see also http://unicode.org/unicode/faq/utf_bom.html#27)

For some more information on UTF-8 detection, see the presentation that
first brought up this idea, at
http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf
(there is a typo in the title, as well as some mistakes in the
byte patterns on page 5, and it doesn't take into account the
prohibition of overlong encodings and reduction to the 20.5byte
codespace, because these weren't around at that time).


Regards,    Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp     

Received on Thursday, 7 December 2006 03:55:42 UTC