New FAQ: Removing UTF-8 BOM from Deborah Cawkwell on 2003-11-05 (public-i18n-geo@w3.org from November 2003)

From: Deborah Cawkwell <deborah.cawkwell@bbc.co.uk>
Date: Wed, 5 Nov 2003 11:55:55 -0000
To: <public-i18n-geo@w3.org>
Message-ID: <418B7E44473AC34488C9E730D09FF3CF0127E9DE@bbcxue204.bu.bbc.co.uk>

Comments on the draft FAQ below are welcomed.
Thanks to Tex & Richard for previous comments/help, though they have not seen this latest version, so any errors are mine.

Thanks

Deborah

--------------------------------------------------------

QUESTION

When I'm using UTF-8 encoding, how do I remove the extra line that sometimes appears at the top of my server include or web page?

BACKGROUND

A particular combination of characters is inserted by some applications to indicate that the text contained in the file is Unicode. This combination of characters is known a Byte Order Mark (BOM). Some applications - such as a text editor or a browser - will interpret the BOM as an extra line in the file and will render it accordingly, others will display .

The BOM is the encoded Unicode Scalar Value (unique reference to a character defined within the Unicode repertoire), U+FEFF, corresponding to the Unicode character 'Zero Width Non-Breaking Space' (ZWNBSP), whose sequence of byte-values is judged to be unlikely to occur at the beginning of any 'normal' text file. Additionally, using U+FEFF for its 'original' purpose as a word joiner in new data is deprecated as of [Unicode3.2] in favour of U+2060 WORD JOINER. Therefore newer documents will only use U+FEFF as a BOM.

In UTF-16 and UTF-32 encodings, the BOM is essential to ensure correct interpretation of the file's contents, because each character in the file is composed of pairs of bytes of data. Also, the order in which these bytes are stored in the file is significant; the BOM indicates this order.

In UTF-8 encodings, the presence of the BOM is not essential because, unlike the UTF-16 or UTF-32 encodings, each byte of the data contained in a UTF-8 file is 'atomic'. This means it contains meta data which indicates whether it represents a whole character in itself, or whether it needs to be combined with a number (1-3) of further bytes to derive an encoded character.

ANSWER

The HTTP header charset declaration, or HTML charset declaration (in the absence of the HTTP header charset declaration, which takes precedence), should normally be used to indicate the encoding. Therefore, if your UTF-8 encoded web page displays an unwanted blank line at the top, and you have an editor capable of displaying the Unicode BOM as described above, you should remove from the beginning of the file the three characters displayed as .

Alternatively, if you have a binary editor capable of displaying the hexadecimal byte values in the file, you should remove the three bytes at the beginning of the file whose values are displayed as EF BB BF.

Alternatively, you can use a Perl script to remove the characters. [Richard - what extra benefit do you gain from this, rather than simply deleting.]

REFERENCES

http://www.unicode.org/unicode/faq/utf_bom.html
http://www.dpawson.co.uk/xsl/sect2/N7702.html#d8771e16
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_42jv.asp
Unicode: A Primer, Tony Graham, M&T Books (IDG Books Worldwide), 2000
(http://www.mulberrytech.com/unicode/primer/)

BBCi at http://www.bbc.co.uk/

This e-mail (and any attachments) is confidential and may contain
personal views which are not the views of the BBC unless specifically
stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in
reliance on it and notify the sender immediately. Please note that the
BBC monitors e-mails sent or received.
Further communication will signify your consent to this.

Received on Wednesday, 5 November 2003 06:55:57 UTC