Re: New FAQ: Removing UTF-8 BOM from Lloyd Honomichl on 2003-11-05 (public-i18n-geo@w3.org from November 2003)

From: Lloyd Honomichl <lloyd@honomichl.com>
Date: Wed, 5 Nov 2003 08:45:14 -0700
To: Tex Texin <tex@i18nguy.com>
Cc: Deborah Cawkwell <deborah.cawkwell@bbc.co.uk>, public-i18n-geo@w3.org
Message-Id: <0C0648E8-0FA7-11D8-A156-0050E43AB91A@honomichl.com>
Dang!  I printed the faq and went over it carefully.  But since Tex has  
a two-hour head start on me time-zone wise (plus he gets up before the  
rooster) he's already got all my points, except:

> Therefore newer documents will only use U+FEFF as a BOM.

Ought to be 'should' not 'will'



On Wednesday, November 5, 2003, at 08:29  AM, Tex Texin wrote:

>
> suggestions embedded by "*tex-":
>
> QUESTION
>
> When I'm using UTF-8 encoding, how do I remove the extra line that  
> sometimes
> appears at the top of my server include
> or web page?
>
> *tex- perhaps:
> How do I remove the erroneous characters or lines that are sometimes  
> shown at
> the top of my utf-8 encoded files?
>
> do we want to focus on removing, or just explaining why they might  
> appear?
>
> BACKGROUND
>
> A particular combination of characters
>
> *tex- bytes not chars. also sequence not combination.
>
> is inserted by some applications to indicate that the text contained  
> in the
> file is Unicode.
>
> *tex- add "utf-8" after unicode.
>
> This combination of characters
>
> *tex- bytes not chars
>
> is known a Byte Order Mark (BOM). Some applications - such as a text  
> editor or
> a browser - will interpret the BOM as an extra line in the file and  
> will render
> it accordingly,
>
> *tex- delete "and will render it accordingly"
>
> others will display .
>
> *tex- use a graphic and say how it might look if treated as iso  
> 8859-1. maybe
> add some other graphics for other encodings?
>
>
> The BOM is the encoded Unicode Scalar Value (unique reference to a  
> character
> defined within the Unicode repertoire),
> U+FEFF, corresponding to the Unicode character 'Zero Width  
> Non-Breaking Space'
> (ZWNBSP), whose sequence of
> byte-values is judged to be unlikely to occur at the beginning of any  
> 'normal'
> text file. Additionally, using U+FEFF for its 'original' purpose as a  
> word
> joiner in new data is deprecated as of [Unicode3.2] in favour of  
> U+2060 WORD
> JOINER.
> Therefore newer documents will only use U+FEFF as a BOM.
>
> In UTF-16 and UTF-32 encodings, the BOM is essential to ensure correct
> interpretation of the file's contents, because
> each character in the file is composed of pairs of bytes of data.   
> Also, the
> order in which these bytes are stored in the file is significant; the  
> BOM
> indicates this order.
>
> *tex- this is all good, but I wonder if it belongs in the background  
> or perhaps
> moved to a separate faq "what's a bom?"
>
> In UTF-8 encodings, the presence of the BOM is not essential because,  
> unlike
> the UTF-16 or UTF-32 encodings, each
> byte of the data contained in a UTF-8 file is 'atomic'. This means it  
> contains
> meta data which indicates whether it
> represents a whole character in itself, or whether it needs to be  
> combined with
> a number (1-3) of further bytes to derive
> an encoded character.
>
>
> *tex- this is wrong. The BOM is not needed in utf-8 because being byte  
> values,
> there is no question of how to order the bytes. utf-16 and utf-32  
> being units
> greater than 1 byte, are ordered differently when written by  
> big-endian or
> little endian platforms, and need an indicator of which way they are  
> written.
>
> *tex- The meta data indicating the number of bytes per character is  
> irrelevant
> to this question.
>
>
> ANSWER
>
> The HTTP header charset declaration, or HTML charset declaration (in  
> the
> absence of the HTTP header charset
> declaration, which takes precedence), should normally be used to  
> indicate the
> encoding. Therefore, if your UTF-8
> encoded web page displays an  unwanted blank line
>
> *tex- add "or erroneous characters"
> *tex- also maybe delete the first sentence, since we should also cover  
> xml,
> css, etc.
>
> at the top, and you have an editor capable of displaying the Unicode  
> BOM as
> described above, you should remove from the beginning of the file the  
> three
> characters displayed as
> .
>
>
> *tex- delete "displayed as...". if you have a utf-8 editor you won't  
> see that.
> Actually if you have a bom knowledgeable editor you won't see  
> anything. Instead
> you (hopefully) have a choice to save with or without a bom. Let's say  
> that
> instead.
>
> *tex- if you have an editor that doesn't understand utf-8 or unicode,  
> you may
> be able to delete the first few bytes by removing the first few  
> characters, but
> be careful you remove all 3 bytes and no more or less.
>
>
>
> Alternatively, if you have a binary editor capable of displaying the
> hexadecimal byte values in the file, you should remove the three bytes  
> at the
> beginning of the file whose values are displayed as EF BB BF.
>
>
>
> Alternatively, you can use a Perl script to remove the characters.  
> [Richard -
> what extra benefit do you gain from this,
> rather than simply deleting.]
>
>
> *tex- maybe remove the perl reference since we are not sure it works  
> well for
> all versions, platforms, etc.
>
>
>
> REFERENCES
> http://www.unicode.org/unicode/faq/utf_bom.html
> http://www.dpawson.co.uk/xsl/sect2/N7702.html#d8771e16
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/ 
> unicode_42jv.asp
> Unicode: A Primer, Tony Graham, M&T Books (IDG Books Worldwide), 2000
> (http://www.mulberrytech.com/unicode/primer/)
>
> *tex- I would remove the mulberry and pawson references.
>
>
> -- 
> -------------------------------------------------------------
> Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
> Xen Master                          http://www.i18nGuy.com
>
> XenCraft		            http://www.XenCraft.com
> Making e-Business Work Around the World
> -------------------------------------------------------------
>
>
Received on Wednesday, 5 November 2003 10:46:25 UTC