- From: Addison Phillips <addison@yahoo-inc.com>
- Date: Wed, 06 Dec 2006 11:11:15 -0800
- To: "McDonald, Ira" <imcdonald@sharplabs.com>
- CC: "'Richard Ishida'" <ishida@w3.org>, "'Chris Lilley'" <chris@w3.org>, www-validator@w3.org, www-international@w3.org
Ira wrote: > (a) it's useless as a signature (a small fragment of > UTF-8 can be reliably auto-detected without BOM); Actually, that's not quite true. A UTF-8 detector can indeed validate whether some block of bytes would be valid as UTF-8 (or not). However, *small* runs of text using certain non-UTF-8 encodings (notably multibyte encodings such as EUC) can fool this test. Encodings that can mimic UTF-8 are as statistically unlikely to do so over a run longer than a few character as other encodings. However, fields such as a person's given or family name in the languages commonly encoded by these encodings can fool a UTF-8 detector with pretty high frequency, if you blindly rely on the bit pattern, since the text tends to be short anyway. Addison -- Addison Phillips Globalization Architect -- Yahoo! Inc. Internationalization is an architecture. It is not a feature.
Received on Wednesday, 6 December 2006 19:11:44 UTC