W3C home > Mailing lists > Public > www-international@w3.org > October to December 2006

Re: Strange advice re BOM and UTF-8

From: Addison Phillips <addison@yahoo-inc.com>
Date: Wed, 06 Dec 2006 11:11:15 -0800
Message-ID: <457715D3.1050407@yahoo-inc.com>
To: "McDonald, Ira" <imcdonald@sharplabs.com>
CC: "'Richard Ishida'" <ishida@w3.org>, "'Chris Lilley'" <chris@w3.org>, www-validator@w3.org, www-international@w3.org

Ira wrote:

 > (a) it's useless as a signature (a small fragment of
 >     UTF-8 can be reliably auto-detected without BOM);

Actually, that's not quite true. A UTF-8 detector can indeed validate 
whether some block of bytes would be valid as UTF-8 (or not). However, 
*small* runs of text using certain non-UTF-8 encodings (notably 
multibyte encodings such as EUC) can fool this test. Encodings that can 
mimic UTF-8 are as statistically unlikely to do so over a run longer 
than a few character as other encodings. However, fields such as a 
person's given or family name in the languages commonly encoded by these 
encodings can fool a UTF-8 detector with pretty high frequency, if you 
blindly rely on the bit pattern, since the text tends to be short anyway.


Addison Phillips
Globalization Architect -- Yahoo! Inc.

Internationalization is an architecture.
It is not a feature.
Received on Wednesday, 6 December 2006 19:11:47 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:27 UTC