W3C home > Mailing lists > Public > www-international@w3.org > October to December 2006

Re: Strange advice re BOM and UTF-8

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Wed, 6 Dec 2006 17:33:50 +0200 (EET)
To: "Eric J. Bowman" <ericbowman@msbx.net>
cc: Chris Lilley <chris@w3.org>, www-validator@w3.org, www-international@w3.org
Message-ID: <Pine.GSO.4.64.0612061726190.9370@mustatilhi.cs.tut.fi>

On Wed, 6 Dec 2006, Eric J. Bowman wrote:

> The case in point is Macromedia HomeSite, which is still widely used by
> working web developers but is not Unicode compliant.  Opening and saving XML
> documents in HomeSite will lead to multiple BOMs -- the first one may be
> standards-compliant but the rest are unsightly!

"Multiple BOMs" is not an error, and doesn't even exist. The character 
U+FEFF is to be interpreted as BOM only at the start of a file or data 
stream. Otherwise, it has the semantics suggested by its Unicode name, 
ZERO-WIDTH NO-BREAK SPACE. Such usage is not recommended in the standard; 
we are supposed to use U+2060 WORD JOINER instead. (Here on Earth, 
however, U+FEFF seems to be better supported than U+2060.) Yet, such usage 
is standards-conforming, and conforming software must not simply remove 
"the second BOM" when it gets data that starts with U+FEFF U+FEFF. (It 
may make an informed decision to ignore the latter code point but only 
because it decides to ignore a leading zero-width no-break space.)

Of course, generating several U+FEFF at the start of a file is a bad idea 
and may confuse software that purports to support Unicode but doesn't.

Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Wednesday, 6 December 2006 15:34:08 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:27 UTC