Re: Strange advice re BOM and UTF-8

Ira wrote:

 > (a) it's useless as a signature (a small fragment of
 >     UTF-8 can be reliably auto-detected without BOM);

Actually, that's not quite true. A UTF-8 detector can indeed validate 
whether some block of bytes would be valid as UTF-8 (or not). However, 
*small* runs of text using certain non-UTF-8 encodings (notably 
multibyte encodings such as EUC) can fool this test. Encodings that can 
mimic UTF-8 are as statistically unlikely to do so over a run longer 
than a few character as other encodings. However, fields such as a 
person's given or family name in the languages commonly encoded by these 
encodings can fool a UTF-8 detector with pretty high frequency, if you 
blindly rely on the bit pattern, since the text tends to be short anyway.

Addison

-- 
Addison Phillips
Globalization Architect -- Yahoo! Inc.

Internationalization is an architecture.
It is not a feature.

Received on Wednesday, 6 December 2006 19:11:44 UTC