- From: Lloyd Honomichl <lloyd@honomichl.com>
- Date: Wed, 5 Nov 2003 08:45:14 -0700
- To: Tex Texin <tex@i18nguy.com>
- Cc: Deborah Cawkwell <deborah.cawkwell@bbc.co.uk>, public-i18n-geo@w3.org
Dang! I printed the faq and went over it carefully. But since Tex has a two-hour head start on me time-zone wise (plus he gets up before the rooster) he's already got all my points, except: > Therefore newer documents will only use U+FEFF as a BOM. Ought to be 'should' not 'will' On Wednesday, November 5, 2003, at 08:29 AM, Tex Texin wrote: > > suggestions embedded by "*tex-": > > QUESTION > > When I'm using UTF-8 encoding, how do I remove the extra line that > sometimes > appears at the top of my server include > or web page? > > *tex- perhaps: > How do I remove the erroneous characters or lines that are sometimes > shown at > the top of my utf-8 encoded files? > > do we want to focus on removing, or just explaining why they might > appear? > > BACKGROUND > > A particular combination of characters > > *tex- bytes not chars. also sequence not combination. > > is inserted by some applications to indicate that the text contained > in the > file is Unicode. > > *tex- add "utf-8" after unicode. > > This combination of characters > > *tex- bytes not chars > > is known a Byte Order Mark (BOM). Some applications - such as a text > editor or > a browser - will interpret the BOM as an extra line in the file and > will render > it accordingly, > > *tex- delete "and will render it accordingly" > > others will display . > > *tex- use a graphic and say how it might look if treated as iso > 8859-1. maybe > add some other graphics for other encodings? > > > The BOM is the encoded Unicode Scalar Value (unique reference to a > character > defined within the Unicode repertoire), > U+FEFF, corresponding to the Unicode character 'Zero Width > Non-Breaking Space' > (ZWNBSP), whose sequence of > byte-values is judged to be unlikely to occur at the beginning of any > 'normal' > text file. Additionally, using U+FEFF for its 'original' purpose as a > word > joiner in new data is deprecated as of [Unicode3.2] in favour of > U+2060 WORD > JOINER. > Therefore newer documents will only use U+FEFF as a BOM. > > In UTF-16 and UTF-32 encodings, the BOM is essential to ensure correct > interpretation of the file's contents, because > each character in the file is composed of pairs of bytes of data. > Also, the > order in which these bytes are stored in the file is significant; the > BOM > indicates this order. > > *tex- this is all good, but I wonder if it belongs in the background > or perhaps > moved to a separate faq "what's a bom?" > > In UTF-8 encodings, the presence of the BOM is not essential because, > unlike > the UTF-16 or UTF-32 encodings, each > byte of the data contained in a UTF-8 file is 'atomic'. This means it > contains > meta data which indicates whether it > represents a whole character in itself, or whether it needs to be > combined with > a number (1-3) of further bytes to derive > an encoded character. > > > *tex- this is wrong. The BOM is not needed in utf-8 because being byte > values, > there is no question of how to order the bytes. utf-16 and utf-32 > being units > greater than 1 byte, are ordered differently when written by > big-endian or > little endian platforms, and need an indicator of which way they are > written. > > *tex- The meta data indicating the number of bytes per character is > irrelevant > to this question. > > > ANSWER > > The HTTP header charset declaration, or HTML charset declaration (in > the > absence of the HTTP header charset > declaration, which takes precedence), should normally be used to > indicate the > encoding. Therefore, if your UTF-8 > encoded web page displays an unwanted blank line > > *tex- add "or erroneous characters" > *tex- also maybe delete the first sentence, since we should also cover > xml, > css, etc. > > at the top, and you have an editor capable of displaying the Unicode > BOM as > described above, you should remove from the beginning of the file the > three > characters displayed as > . > > > *tex- delete "displayed as...". if you have a utf-8 editor you won't > see that. > Actually if you have a bom knowledgeable editor you won't see > anything. Instead > you (hopefully) have a choice to save with or without a bom. Let's say > that > instead. > > *tex- if you have an editor that doesn't understand utf-8 or unicode, > you may > be able to delete the first few bytes by removing the first few > characters, but > be careful you remove all 3 bytes and no more or less. > > > > Alternatively, if you have a binary editor capable of displaying the > hexadecimal byte values in the file, you should remove the three bytes > at the > beginning of the file whose values are displayed as EF BB BF. > > > > Alternatively, you can use a Perl script to remove the characters. > [Richard - > what extra benefit do you gain from this, > rather than simply deleting.] > > > *tex- maybe remove the perl reference since we are not sure it works > well for > all versions, platforms, etc. > > > > REFERENCES > http://www.unicode.org/unicode/faq/utf_bom.html > http://www.dpawson.co.uk/xsl/sect2/N7702.html#d8771e16 > http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/ > unicode_42jv.asp > Unicode: A Primer, Tony Graham, M&T Books (IDG Books Worldwide), 2000 > (http://www.mulberrytech.com/unicode/primer/) > > *tex- I would remove the mulberry and pawson references. > > > -- > ------------------------------------------------------------- > Tex Texin cell: +1 781 789 1898 mailto:Tex@XenCraft.com > Xen Master http://www.i18nGuy.com > > XenCraft http://www.XenCraft.com > Making e-Business Work Around the World > ------------------------------------------------------------- > >
Received on Wednesday, 5 November 2003 10:46:25 UTC