RE: UTF-8 BOM FAQ from Richard Ishida on 2003-11-17 (public-i18n-geo@w3.org from November 2003)

From: Richard Ishida <ishida@w3.org>
Date: Mon, 17 Nov 2003 08:01:28 -0000
To: <public-i18n-geo@w3.org>
Message-ID: <001001c3ace1$0506d9e0$0d00a8c0@w3c40upc3ma3j2>
Deborah's FAQ is at
http://www.w3.org/International/questions/qa-utf8-bom.html

RI

============
Richard Ishida
W3C

contact info: http://www.w3.org/People/Ishida/ 

http://www.w3.org/International/ 
http://www.w3.org/International/geo/ 

W3C Internationalization FAQs
http://www.w3.org/International/questions.html
RSS feed: http://www.w3.org/International/questions.rss



> -----Original Message-----
> From: public-i18n-geo-request@w3.org 
> [mailto:public-i18n-geo-request@w3.org] On Behalf Of Richard Ishida
> Sent: 17 November 2003 07:59
> To: 'Deborah Cawkwell'; public-i18n-geo@w3.org
> Subject: RE: UTF-8 BOM FAQ
> 
> 
> 
> Thanks Deborah !
> 
> I have updated the FAQ online with the suggested changes.  I 
> think this is quite a lot better.  I haven't added the 
> background stuff yet though. There are three main reasons: a. 
> it seems rather long, b. I'm not sure whether such detail is 
> appropriate in *this* FAQ, c. I haven't had the opportunity 
> to read it in detail yet.  Anyone else reading this, please 
> send in your thoughts!
> 
> I left in change marks so people can see what's different.
> 
> Wrt "which browsers display the extra line when encountering 
> the UTF-8 BOM" try Netscape 4.8.
> 
> RI
> 
> 
> > -----Original Message-----
> > From: public-i18n-geo-request@w3.org
> > [mailto:public-i18n-geo-request@w3.org] On Behalf Of 
> Deborah Cawkwell
> > Sent: 16 November 2003 22:18
> > To: public-i18n-geo@w3.org
> > Subject: UTF-8 BOM FAQ
> > 
> > 
> > 
> > Hi All
> > 
> > Apologies that the FAQ is in text.
> > 
> > I have slanted the answer differently following feedback.
> > - Following the conference call, I have not yet identified
> > which browsers display the extra line when encountering the 
> UTF-8 BOM.
> 
> Netscape 4.8 does
> 
> > - Re removing the BOM, we have found no problem re-opening
> > the file (in Notepad, which is the only text editor I know 
> > that displays readable text, eg Persian).
> > - I feel strongly about including the UTF-8 table because it 
> > clarifies so much (for me anyway).
> > 
> > For the next three days, I will be out of the office.
> > 
> > For those on the conference call (& anyone else), a bit more
> > localised information about UK Bonfire night and Guy Fawkes: 
> > http://www.bbc.co.uk/dna/h2g2/A199488.
> > 
> > Deborah
> > 
> > ------------------------------------
> > 
> > FAQ: Unexpected blank lines or characters with UTF-8 encoding
> > question - background - answer - by the way - useful links
> > 
> > Question
> > When I'm using a UTF-8 encoding, why does an extra line
> > appear at the top of my web page, and how do I remove it?
> > 
> > Answer
> > See the Background information.
> > 
> > This may be caused by the presence of a UTF-8 signature at
> > the beginning of the file, which the user agent doesn't 
> > recognize. Note that a number of more recent browsers, such 
> > as the latest Windows-based versions of Internet Explorer, 
> > Mozilla (Netscape) and Opera, do not exhibit this behaviour.
> > 
> > You may not be able to see the cause of the extra line or
> > space in your editor, if it interprets UTF-8 correctly. An 
> > editor which does not interpret UTF-8 correctly, displays the 
> > UTF-8 signature according to its own character encoding 
> > setting. With the Latin 1 ISO 8859-1 character encoding, the 
> > signature displays as extraneous characters 﫿. With a 
> > binary editor capable of displaying the hexadecimal byte 
> > values in the file, the UTF-8 signature displays as EF BB BF. 
> > 
> > To remove the extra line or spaces that appear in the
> > browser, remove the extraneous characters, which represent 
> > the UTF-8 signature. You can remove them manually or with a 
> > script. One of the benefits of using a script is that you can 
> > remove the extraneous characters from multiple files.
> > 
> > You should check thoroughly the result of removing the
> > signature, bearing in mind that pages with a high proportion 
> > of Latin characters may look correct superficially, but that 
> > characters outside the ASCII compatibility range (U+0000 to 
> > U+007F) may be incorrectly encoded.
> > 
> > If there is no evidence of a UTF-8 signature at the beginning
> > of the file, then your problem lies elsewhere.
> > 
> > 
> > Background
> > 
> > An editor that does not correctly interpret Unicode
> > (encodings: UTF-8, UTF-16, UTF-32) recognises each byte as 
> > referring to one character (some editors may assume two-bytes 
> > per character); the character referred to by that byte value 
> > depends on the encoding assumed by the editor. An editor that 
> > does correctly interpret UTF-8 recognises that a character 
> > reference can require 1-4 bytes. In UTF-8 encoding, the 
> > number of bytes used to refer to a Unique Scalar Value in the 
> > Unicode repertoire is determined by the first byte.
> > 
> > 	Unicode character				UTF-8 
> > byte 1	UTF-8 byte 2	UTF-8 byte 3	UTF-8 byte 4
> > 
> > 	0000 to 007F (ASCII)			01xxxxxx
> > 	0080 to 07FF					
> > 110xxxxx			10xxxxxx
> > 	0800 to FFFF					
> > 1110xxxx			10xxxxxx			10xxxxxx
> > 	10000 to 10FFFF				11110xxx	
> > 		10xxxxxx			10xxxxxx	
> > 		10xxxxxx
> > 
> > All Unicode characters encoded in UTF-8, which fall outside
> > the ASCII compatibility range (0 to 127 decimal), have byte 
> > values greater than 127 decimal, which explains why they 
> > display as 'strange' characters in non-UTF-8 compliant 
> > editors and browsers.
> > 
> > The UTF-8 signature is also known as the Unicode UTF-8 Byte
> > Order Mark (BOM). UTF-8 is one encoding of the Unicode 
> > repertoire. Others include UTF-16 and UTF-32. All three 
> > encodings encode the same Unicode character repertoire, but 
> > they differ in the sequence of byte values which refer to 
> > that repertoire. 
> > 
> > - UTF-8 uses 1-4 bytes; the first byte of each encoded
> > character determines how many subsequent bytes in the 
> > sequence are required.
> > - UTF-16 uses 2 or 4 bytes; if more than two bytes are 
> > required, then the first two bytes refer to a reserved value 
> > in the Unicode repertoire, which indicate that the next two 
> > bytes should be used to obtain the final value.
> > - UTF-32 always uses four bytes.
> > 
> > All three Unicode encodings (UTF-8, UTF-16, UTF-32) can use
> > the signature or BOM (Byte Order Mark). Whilst with UTF-8, 
> > the BOM serves one purpose as a 'signature' of Unicode, for 
> > UTF-16 and UTF-32, the BOM has further purpose. That purpose 
> > is to indicate the order in which the bytes should be read. 
> > This order varies according to processor architecture, which 
> > can be 'big' or 'little' 'endian':
> > 
> > 	- Macs - Motorola, PowerPC = big endian
> > 	- PCs - Intel = little endian	
> > 	- UNIX - different processors, therefore big or little endian
> > 	
> > There are other Unicode encodings which do not require the
> > presence of the BOM; these are "UTF-16LE", "UTF-16-BE", 
> > "UTF-32LE" and "UTF-32BE".
> > 
> > By the way
> > You will find that Windows Notepad and Helios Textpad will
> > automatically add a UTF-8 signature to any file you save as UTF-8.
> > 
> > A UTF-8 signature at the beginning of a CSS file can
> > sometimes cause the initial rules in the file to fail on 
> > certain user agents.
> > 
> > Cutting and pasting UTF-8 text between different applications
> > can have unexpected results, even if both applications are 
> > nominally UTF-8 aware.
> > 
> > Useful links
> > Unicode FAQ about the Byte Order Mark:
> > http://www.unicode.org/unicode/faq/utf_bom.html
> > 
> > Microsoft
> > documentation about the Byte Order 
> > Markhttp://msdn.microsoft.com/library/default.asp?url=/library
> /en-us/intl/unicode_42jv.asp
> > 
> > Apache content negotiation documentation:
> > http://httpd.apache.org/docs/content-negotiation.html
> > 
> > 
> > 
> > BBCi
> > at http://www.bbc.co.uk/
> > 
> > This e-mail (and any attachments) is confidential and may
> > contain personal views which are not the views of the BBC 
> > unless specifically stated. If you have received it in error, 
> > please delete it from your system. 
> > Do not use, copy or disclose the information in any way nor 
> > act in reliance on it and notify the sender immediately. 
> > Please note that the BBC monitors e-mails sent or received. 
> > Further communication will signify your consent to this.
> > 
>
Received on Monday, 17 November 2003 03:02:11 UTC