- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Thu, 07 Dec 2006 10:22:37 +0900
- To: Addison Phillips <addison@yahoo-inc.com>, "McDonald, Ira" <imcdonald@sharplabs.com>
- Cc: "'Richard Ishida'" <ishida@w3.org>, "'Chris Lilley'" <chris@w3.org>, www-validator@w3.org, www-international@w3.org
At 04:11 06/12/07, Addison Phillips wrote: > >Ira wrote: > > > (a) it's useless as a signature (a small fragment of > > UTF-8 can be reliably auto-detected without BOM); > >Actually, that's not quite true. A UTF-8 detector can indeed validate whether some block of bytes would be valid as UTF-8 (or not). However, *small* runs of text using certain non-UTF-8 encodings (notably multibyte encodings such as EUC) can fool this test. Encodings that can mimic UTF-8 are as statistically unlikely to do so over a run longer than a few character as other encodings. However, fields such as a person's given or family name in the languages commonly encoded by these encodings can fool a UTF-8 detector with pretty high frequency, if you blindly rely on the bit pattern, since the text tends to be short anyway. Agreed. But then again, single fields don't usually occur as independent files, and in other contexts (protocol fields, database fields) you certainly don't want a BOM. (see also http://unicode.org/unicode/faq/utf_bom.html#27) For some more information on UTF-8 detection, see the presentation that first brought up this idea, at http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf (there is a typo in the title, as well as some mistakes in the byte patterns on page 5, and it doesn't take into account the prohibition of overlong encodings and reduction to the 20.5byte codespace, because these weren't around at that time). Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Thursday, 7 December 2006 03:55:42 UTC