- From: Tex Texin <tex@i18nguy.com>
- Date: Wed, 05 Nov 2003 10:29:37 -0500
- To: Deborah Cawkwell <deborah.cawkwell@bbc.co.uk>
- Cc: public-i18n-geo@w3.org
suggestions embedded by "*tex-": QUESTION When I'm using UTF-8 encoding, how do I remove the extra line that sometimes appears at the top of my server include or web page? *tex- perhaps: How do I remove the erroneous characters or lines that are sometimes shown at the top of my utf-8 encoded files? do we want to focus on removing, or just explaining why they might appear? BACKGROUND A particular combination of characters *tex- bytes not chars. also sequence not combination. is inserted by some applications to indicate that the text contained in the file is Unicode. *tex- add "utf-8" after unicode. This combination of characters *tex- bytes not chars is known a Byte Order Mark (BOM). Some applications - such as a text editor or a browser - will interpret the BOM as an extra line in the file and will render it accordingly, *tex- delete "and will render it accordingly" others will display . *tex- use a graphic and say how it might look if treated as iso 8859-1. maybe add some other graphics for other encodings? The BOM is the encoded Unicode Scalar Value (unique reference to a character defined within the Unicode repertoire), U+FEFF, corresponding to the Unicode character 'Zero Width Non-Breaking Space' (ZWNBSP), whose sequence of byte-values is judged to be unlikely to occur at the beginning of any 'normal' text file. Additionally, using U+FEFF for its 'original' purpose as a word joiner in new data is deprecated as of [Unicode3.2] in favour of U+2060 WORD JOINER. Therefore newer documents will only use U+FEFF as a BOM. In UTF-16 and UTF-32 encodings, the BOM is essential to ensure correct interpretation of the file's contents, because each character in the file is composed of pairs of bytes of data. Also, the order in which these bytes are stored in the file is significant; the BOM indicates this order. *tex- this is all good, but I wonder if it belongs in the background or perhaps moved to a separate faq "what's a bom?" In UTF-8 encodings, the presence of the BOM is not essential because, unlike the UTF-16 or UTF-32 encodings, each byte of the data contained in a UTF-8 file is 'atomic'. This means it contains meta data which indicates whether it represents a whole character in itself, or whether it needs to be combined with a number (1-3) of further bytes to derive an encoded character. *tex- this is wrong. The BOM is not needed in utf-8 because being byte values, there is no question of how to order the bytes. utf-16 and utf-32 being units greater than 1 byte, are ordered differently when written by big-endian or little endian platforms, and need an indicator of which way they are written. *tex- The meta data indicating the number of bytes per character is irrelevant to this question. ANSWER The HTTP header charset declaration, or HTML charset declaration (in the absence of the HTTP header charset declaration, which takes precedence), should normally be used to indicate the encoding. Therefore, if your UTF-8 encoded web page displays an unwanted blank line *tex- add "or erroneous characters" *tex- also maybe delete the first sentence, since we should also cover xml, css, etc. at the top, and you have an editor capable of displaying the Unicode BOM as described above, you should remove from the beginning of the file the three characters displayed as . *tex- delete "displayed as...". if you have a utf-8 editor you won't see that. Actually if you have a bom knowledgeable editor you won't see anything. Instead you (hopefully) have a choice to save with or without a bom. Let's say that instead. *tex- if you have an editor that doesn't understand utf-8 or unicode, you may be able to delete the first few bytes by removing the first few characters, but be careful you remove all 3 bytes and no more or less. Alternatively, if you have a binary editor capable of displaying the hexadecimal byte values in the file, you should remove the three bytes at the beginning of the file whose values are displayed as EF BB BF. Alternatively, you can use a Perl script to remove the characters. [Richard - what extra benefit do you gain from this, rather than simply deleting.] *tex- maybe remove the perl reference since we are not sure it works well for all versions, platforms, etc. REFERENCES http://www.unicode.org/unicode/faq/utf_bom.html http://www.dpawson.co.uk/xsl/sect2/N7702.html#d8771e16 http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_42jv.asp Unicode: A Primer, Tony Graham, M&T Books (IDG Books Worldwide), 2000 (http://www.mulberrytech.com/unicode/primer/) *tex- I would remove the mulberry and pawson references. -- ------------------------------------------------------------- Tex Texin cell: +1 781 789 1898 mailto:Tex@XenCraft.com Xen Master http://www.i18nGuy.com XenCraft http://www.XenCraft.com Making e-Business Work Around the World -------------------------------------------------------------
Received on Wednesday, 5 November 2003 10:30:30 UTC