UTF-8 BOM FAQ

Hi All

Apologies that the FAQ is in text.

I have slanted the answer differently following feedback.
- Following the conference call, I have not yet identified which browsers display the extra line when encountering the UTF-8 BOM.
- Re removing the BOM, we have found no problem re-opening the file (in Notepad, which is the only text editor I know that displays readable text, eg Persian).
- I feel strongly about including the UTF-8 table because it clarifies so much (for me anyway).

For the next three days, I will be out of the office.

For those on the conference call (& anyone else), a bit more localised information about UK Bonfire night and Guy Fawkes: http://www.bbc.co.uk/dna/h2g2/A199488.

Deborah

------------------------------------

FAQ: Unexpected blank lines or characters with UTF-8 encoding
question - background - answer - by the way - useful links

Question
When I'm using a UTF-8 encoding, why does an extra line appear at the top of my web page, and how do I remove it?

Answer
See the Background information.

This may be caused by the presence of a UTF-8 signature at the beginning of the file, which the user agent doesn't recognize. Note that a number of more recent browsers, such as the latest Windows-based versions of Internet Explorer, Mozilla (Netscape) and Opera, do not exhibit this behaviour.

You may not be able to see the cause of the extra line or space in your editor, if it interprets UTF-8 correctly. An editor which does not interpret UTF-8 correctly, displays the UTF-8 signature according to its own character encoding setting. With the Latin 1 ISO 8859-1 character encoding, the signature displays as extraneous characters 﫿. With a binary editor capable of displaying the hexadecimal byte values in the file, the UTF-8 signature displays as EF BB BF. 

To remove the extra line or spaces that appear in the browser, remove the extraneous characters, which represent the UTF-8 signature. You can remove them manually or with a script. One of the benefits of using a script is that you can remove the extraneous characters from multiple files.

You should check thoroughly the result of removing the signature, bearing in mind that pages with a high proportion of Latin characters may look correct superficially, but that characters outside the ASCII compatibility range (U+0000 to U+007F) may be incorrectly encoded.

If there is no evidence of a UTF-8 signature at the beginning of the file, then your problem lies elsewhere.


Background

An editor that does not correctly interpret Unicode (encodings: UTF-8, UTF-16, UTF-32) recognises each byte as referring to one character (some editors may assume two-bytes per character); the character referred to by that byte value depends on the encoding assumed by the editor. An editor that does correctly interpret UTF-8 recognises that a character reference can require 1-4 bytes. In UTF-8 encoding, the number of bytes used to refer to a Unique Scalar Value in the Unicode repertoire is determined by the first byte.

	Unicode character				UTF-8 byte 1	UTF-8 byte 2	UTF-8 byte 3	UTF-8 byte 4

	0000 to 007F (ASCII)			01xxxxxx
	0080 to 07FF					110xxxxx			10xxxxxx
	0800 to FFFF					1110xxxx			10xxxxxx			10xxxxxx
	10000 to 10FFFF				11110xxx			10xxxxxx			10xxxxxx			10xxxxxx

All Unicode characters encoded in UTF-8, which fall outside the ASCII compatibility range (0 to 127 decimal), have byte values greater than 127 decimal, which explains why they display as 'strange' characters in non-UTF-8 compliant editors and browsers.

The UTF-8 signature is also known as the Unicode UTF-8 Byte Order Mark (BOM). UTF-8 is one encoding of the Unicode repertoire. Others include UTF-16 and UTF-32. All three encodings encode the same Unicode character repertoire, but they differ in the sequence of byte values which refer to that repertoire. 

- UTF-8 uses 1-4 bytes; the first byte of each encoded character determines how many subsequent bytes in the sequence are required.
- UTF-16 uses 2 or 4 bytes; if more than two bytes are required, then the first two bytes refer to a reserved value in the Unicode repertoire, which indicate that the next two bytes should be used to obtain the final value.
- UTF-32 always uses four bytes.

All three Unicode encodings (UTF-8, UTF-16, UTF-32) can use the signature or BOM (Byte Order Mark). Whilst with UTF-8, the BOM serves one purpose as a 'signature' of Unicode, for UTF-16 and UTF-32, the BOM has further purpose. That purpose is to indicate the order in which the bytes should be read. This order varies according to processor architecture, which can be 'big' or 'little' 'endian':

	- Macs - Motorola, PowerPC = big endian
	- PCs - Intel = little endian	
	- UNIX - different processors, therefore big or little endian
	
There are other Unicode encodings which do not require the presence of the BOM; these are "UTF-16LE", "UTF-16-BE", "UTF-32LE" and "UTF-32BE".

By the way
You will find that Windows Notepad and Helios Textpad will automatically add a UTF-8 signature to any file you save as UTF-8.

A UTF-8 signature at the beginning of a CSS file can sometimes cause the initial rules in the file to fail on certain user agents.

Cutting and pasting UTF-8 text between different applications can have unexpected results, even if both applications are nominally UTF-8 aware.

Useful links
Unicode FAQ about the Byte Order Mark: http://www.unicode.org/unicode/faq/utf_bom.html

Microsoft documentation about the Byte Order Markhttp://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_42jv.asp

Apache content negotiation documentation: http://httpd.apache.org/docs/content-negotiation.html



BBCi at http://www.bbc.co.uk/

This e-mail (and any attachments) is confidential and may contain
personal views which are not the views of the BBC unless specifically
stated.
If you have received it in error, please delete it from your system. 
Do not use, copy or disclose the information in any way nor act in
reliance on it and notify the sender immediately. Please note that the
BBC monitors e-mails sent or received. 
Further communication will signify your consent to this.

Received on Sunday, 16 November 2003 17:18:02 UTC