Re: New FAQ: Removing UTF-8 BOM from Tex Texin on 2003-11-05 (public-i18n-geo@w3.org from November 2003)

From: Tex Texin <tex@i18nguy.com>
Date: Wed, 05 Nov 2003 10:29:37 -0500
To: Deborah Cawkwell <deborah.cawkwell@bbc.co.uk>
Cc: public-i18n-geo@w3.org
Message-ID: <3FA91761.597B4523@i18nguy.com>
suggestions embedded by "*tex-":

QUESTION

When I'm using UTF-8 encoding, how do I remove the extra line that sometimes
appears at the top of my server include
or web page?

*tex- perhaps:
How do I remove the erroneous characters or lines that are sometimes shown at
the top of my utf-8 encoded files?

do we want to focus on removing, or just explaining why they might appear? 

BACKGROUND

A particular combination of characters

*tex- bytes not chars. also sequence not combination.

is inserted by some applications to indicate that the text contained in the
file is Unicode. 

*tex- add "utf-8" after unicode.

This combination of characters 

*tex- bytes not chars

is known a Byte Order Mark (BOM). Some applications - such as a text editor or
a browser - will interpret the BOM as an extra line in the file and will render
it accordingly,

*tex- delete "and will render it accordingly"
 
others will display .

*tex- use a graphic and say how it might look if treated as iso 8859-1. maybe
add some other graphics for other encodings?


The BOM is the encoded Unicode Scalar Value (unique reference to a character
defined within the Unicode repertoire),
U+FEFF, corresponding to the Unicode character 'Zero Width Non-Breaking Space'
(ZWNBSP), whose sequence of
byte-values is judged to be unlikely to occur at the beginning of any 'normal'
text file. Additionally, using U+FEFF for its 'original' purpose as a word
joiner in new data is deprecated as of [Unicode3.2] in favour of U+2060 WORD
JOINER.
Therefore newer documents will only use U+FEFF as a BOM.

In UTF-16 and UTF-32 encodings, the BOM is essential to ensure correct
interpretation of the file's contents, because
each character in the file is composed of pairs of bytes of data.  Also, the
order in which these bytes are stored in the file is significant; the BOM
indicates this order.

*tex- this is all good, but I wonder if it belongs in the background or perhaps
moved to a separate faq "what's a bom?" 

In UTF-8 encodings, the presence of the BOM is not essential because, unlike
the UTF-16 or UTF-32 encodings, each
byte of the data contained in a UTF-8 file is 'atomic'. This means it contains
meta data which indicates whether it
represents a whole character in itself, or whether it needs to be combined with
a number (1-3) of further bytes to derive
an encoded character.

 
*tex- this is wrong. The BOM is not needed in utf-8 because being byte values,
there is no question of how to order the bytes. utf-16 and utf-32 being units
greater than 1 byte, are ordered differently when written by big-endian or
little endian platforms, and need an indicator of which way they are written.

*tex- The meta data indicating the number of bytes per character is irrelevant
to this question.
 

ANSWER 

The HTTP header charset declaration, or HTML charset declaration (in the
absence of the HTTP header charset
declaration, which takes precedence), should normally be used to indicate the
encoding. Therefore, if your UTF-8
encoded web page displays an  unwanted blank line 

*tex- add "or erroneous characters"
*tex- also maybe delete the first sentence, since we should also cover xml,
css, etc.

at the top, and you have an editor capable of displaying the Unicode BOM as
described above, you should remove from the beginning of the file the three
characters displayed as
.


*tex- delete "displayed as...". if you have a utf-8 editor you won't see that.
Actually if you have a bom knowledgeable editor you won't see anything. Instead
you (hopefully) have a choice to save with or without a bom. Let's say that
instead. 

*tex- if you have an editor that doesn't understand utf-8 or unicode, you may
be able to delete the first few bytes by removing the first few characters, but
be careful you remove all 3 bytes and no more or less.

 

Alternatively, if you have a binary editor capable of displaying the
hexadecimal byte values in the file, you should remove the three bytes at the
beginning of the file whose values are displayed as EF BB BF.

 

Alternatively, you can use a Perl script to remove the characters. [Richard -
what extra benefit do you gain from this,
rather than simply deleting.]

 
*tex- maybe remove the perl reference since we are not sure it works well for
all versions, platforms, etc.

 

REFERENCES
http://www.unicode.org/unicode/faq/utf_bom.html
http://www.dpawson.co.uk/xsl/sect2/N7702.html#d8771e16
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_42jv.asp
Unicode: A Primer, Tony Graham, M&T Books (IDG Books Worldwide), 2000
(http://www.mulberrytech.com/unicode/primer/)

*tex- I would remove the mulberry and pawson references.


-- 
-------------------------------------------------------------
Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
Xen Master                          http://www.i18nGuy.com
                         
XenCraft		            http://www.XenCraft.com
Making e-Business Work Around the World
-------------------------------------------------------------
Received on Wednesday, 5 November 2003 10:30:30 UTC