Ira wrote: > (a) it's useless as a signature (a small fragment of > UTF-8 can be reliably auto-detected without BOM); Actually, that's not quite true. A UTF-8 detector can indeed validate whether some block of bytes would be valid as UTF-8 (or not). However, *small* runs of text using certain non-UTF-8 encodings (notably multibyte encodings such as EUC) can fool this test. Encodings that can mimic UTF-8 are as statistically unlikely to do so over a run longer than a few character as other encodings. However, fields such as a person's given or family name in the languages commonly encoded by these encodings can fool a UTF-8 detector with pretty high frequency, if you blindly rely on the bit pattern, since the text tends to be short anyway. Addison -- Addison Phillips Globalization Architect -- Yahoo! Inc. Internationalization is an architecture. It is not a feature.Received on Wednesday, 6 December 2006 19:11:47 GMT
This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:09 GMT