- From: Martin Duerst <duerst@w3.org>
- Date: Mon, 17 Nov 2003 23:21:37 -0500
- To: <ishida@w3.org>, <public-i18n-geo@w3.org>
Hello Richard, Deborah, Very good work. Some comments below. At 08:01 03/11/17 +0000, Richard Ishida wrote: >Deborah's FAQ is at >http://www.w3.org/International/questions/qa-utf8-bom.html I would like to change the question from: "When I'm using a UTF-8 encoding, why does an extra line appear at the top of my web page, and how do I remove it?" to "When I'm using a UTF-8 encoding, why may an extra line appear at the top of my web page, and how do I remove it?" (i.e. change from 'does' to 'may') I think it's important we don't give, in one way or another, the impression that there is something wrong with UTF-8. Answer, second paragraph: It's not clear to me whether 'interpreting UTF-8 correctly' refers to "interpreting the file as being encoded in UTF-8", or "interpreting the BOM correctly (i.e. not showing it)". "Latin 1 ISO 8859-1 character encoding": use clearer terminology "hexadecimal byte values" -> "byte values in hexadecimal notation" "remove the extraneous characters, which represent the UTF-8 signature" -> "remove the extraneous bytes, which represent the UTF-8 signature" Background: the first paragraph says the BOM is a sequence of bytes. the second paragraph says it's a character. "In UTF-16 and UTF-32 encodings": If the reader hasn't got the message that we are speaking about encodings, we probably need do fix something earlier on. Please change to "UTF-16 and UTF-32" (several similar cases). "Each character in the file is composed of 2 to 4 bytes of data": probably better to replace 'composed of' with 'represented by'. "You will find that Windows Notepad and Helios Textpad will": Are there others? Do we know that there are no others? Word as "some text editors such as ...". "Cutting and pasting UTF-8 text between different applications can have unexpected results, even if both applications are nominally UTF-8 aware." Can we be more specific here? Otherwise, that is more confusing than helpful. Regards, Martin. >RI > >============ >Richard Ishida >W3C > >contact info: http://www.w3.org/People/Ishida/ > >http://www.w3.org/International/ >http://www.w3.org/International/geo/ > >W3C Internationalization FAQs >http://www.w3.org/International/questions.html >RSS feed: http://www.w3.org/International/questions.rss > > > > > -----Original Message----- > > From: public-i18n-geo-request@w3.org > > [mailto:public-i18n-geo-request@w3.org] On Behalf Of Richard Ishida > > Sent: 17 November 2003 07:59 > > To: 'Deborah Cawkwell'; public-i18n-geo@w3.org > > Subject: RE: UTF-8 BOM FAQ > > > > > > > > Thanks Deborah ! > > > > I have updated the FAQ online with the suggested changes. I > > think this is quite a lot better. I haven't added the > > background stuff yet though. There are three main reasons: a. > > it seems rather long, b. I'm not sure whether such detail is > > appropriate in *this* FAQ, c. I haven't had the opportunity > > to read it in detail yet. Anyone else reading this, please > > send in your thoughts! > > > > I left in change marks so people can see what's different. > > > > Wrt "which browsers display the extra line when encountering > > the UTF-8 BOM" try Netscape 4.8. > > > > RI > > > > > > > -----Original Message----- > > > From: public-i18n-geo-request@w3.org > > > [mailto:public-i18n-geo-request@w3.org] On Behalf Of > > Deborah Cawkwell > > > Sent: 16 November 2003 22:18 > > > To: public-i18n-geo@w3.org > > > Subject: UTF-8 BOM FAQ > > > > > > > > > > > > Hi All > > > > > > Apologies that the FAQ is in text. > > > > > > I have slanted the answer differently following feedback. > > > - Following the conference call, I have not yet identified > > > which browsers display the extra line when encountering the > > UTF-8 BOM. > > > > Netscape 4.8 does > > > > > - Re removing the BOM, we have found no problem re-opening > > > the file (in Notepad, which is the only text editor I know > > > that displays readable text, eg Persian). > > > - I feel strongly about including the UTF-8 table because it > > > clarifies so much (for me anyway). > > > > > > For the next three days, I will be out of the office. > > > > > > For those on the conference call (& anyone else), a bit more > > > localised information about UK Bonfire night and Guy Fawkes: > > > http://www.bbc.co.uk/dna/h2g2/A199488. > > > > > > Deborah > > > > > > ------------------------------------ > > > > > > FAQ: Unexpected blank lines or characters with UTF-8 encoding > > > question - background - answer - by the way - useful links > > > > > > Question > > > When I'm using a UTF-8 encoding, why does an extra line > > > appear at the top of my web page, and how do I remove it? > > > > > > Answer > > > See the Background information. > > > > > > This may be caused by the presence of a UTF-8 signature at > > > the beginning of the file, which the user agent doesn't > > > recognize. Note that a number of more recent browsers, such > > > as the latest Windows-based versions of Internet Explorer, > > > Mozilla (Netscape) and Opera, do not exhibit this behaviour. > > > > > > You may not be able to see the cause of the extra line or > > > space in your editor, if it interprets UTF-8 correctly. An > > > editor which does not interpret UTF-8 correctly, displays the > > > UTF-8 signature according to its own character encoding > > > setting. With the Latin 1 ISO 8859-1 character encoding, the > > > signature displays as extraneous characters 鍮信 With a > > > binary editor capable of displaying the hexadecimal byte > > > values in the file, the UTF-8 signature displays as EF BB BF. > > > > > > To remove the extra line or spaces that appear in the > > > browser, remove the extraneous characters, which represent > > > the UTF-8 signature. You can remove them manually or with a > > > script. One of the benefits of using a script is that you can > > > remove the extraneous characters from multiple files. > > > > > > You should check thoroughly the result of removing the > > > signature, bearing in mind that pages with a high proportion > > > of Latin characters may look correct superficially, but that > > > characters outside the ASCII compatibility range (U+0000 to > > > U+007F) may be incorrectly encoded. > > > > > > If there is no evidence of a UTF-8 signature at the beginning > > > of the file, then your problem lies elsewhere. > > > > > > > > > Background > > > > > > An editor that does not correctly interpret Unicode > > > (encodings: UTF-8, UTF-16, UTF-32) recognises each byte as > > > referring to one character (some editors may assume two-bytes > > > per character); the character referred to by that byte value > > > depends on the encoding assumed by the editor. An editor that > > > does correctly interpret UTF-8 recognises that a character > > > reference can require 1-4 bytes. In UTF-8 encoding, the > > > number of bytes used to refer to a Unique Scalar Value in the > > > Unicode repertoire is determined by the first byte. > > > > > > Unicode character UTF-8 > > > byte 1 UTF-8 byte 2 UTF-8 byte 3 UTF-8 byte 4 > > > > > > 0000 to 007F (ASCII) 01xxxxxx > > > 0080 to 07FF > > > 110xxxxx 10xxxxxx > > > 0800 to FFFF > > > 1110xxxx 10xxxxxx 10xxxxxx > > > 10000 to 10FFFF 11110xxx > > > 10xxxxxx 10xxxxxx > > > 10xxxxxx > > > > > > All Unicode characters encoded in UTF-8, which fall outside > > > the ASCII compatibility range (0 to 127 decimal), have byte > > > values greater than 127 decimal, which explains why they > > > display as 'strange' characters in non-UTF-8 compliant > > > editors and browsers. > > > > > > The UTF-8 signature is also known as the Unicode UTF-8 Byte > > > Order Mark (BOM). UTF-8 is one encoding of the Unicode > > > repertoire. Others include UTF-16 and UTF-32. All three > > > encodings encode the same Unicode character repertoire, but > > > they differ in the sequence of byte values which refer to > > > that repertoire. > > > > > > - UTF-8 uses 1-4 bytes; the first byte of each encoded > > > character determines how many subsequent bytes in the > > > sequence are required. > > > - UTF-16 uses 2 or 4 bytes; if more than two bytes are > > > required, then the first two bytes refer to a reserved value > > > in the Unicode repertoire, which indicate that the next two > > > bytes should be used to obtain the final value. > > > - UTF-32 always uses four bytes. > > > > > > All three Unicode encodings (UTF-8, UTF-16, UTF-32) can use > > > the signature or BOM (Byte Order Mark). Whilst with UTF-8, > > > the BOM serves one purpose as a 'signature' of Unicode, for > > > UTF-16 and UTF-32, the BOM has further purpose. That purpose > > > is to indicate the order in which the bytes should be read. > > > This order varies according to processor architecture, which > > > can be 'big' or 'little' 'endian': > > > > > > - Macs - Motorola, PowerPC = big endian > > > - PCs - Intel = little endian > > > - UNIX - different processors, therefore big or little endian > > > > > > There are other Unicode encodings which do not require the > > > presence of the BOM; these are "UTF-16LE", "UTF-16-BE", > > > "UTF-32LE" and "UTF-32BE". > > > > > > By the way > > > You will find that Windows Notepad and Helios Textpad will > > > automatically add a UTF-8 signature to any file you save as UTF-8. > > > > > > A UTF-8 signature at the beginning of a CSS file can > > > sometimes cause the initial rules in the file to fail on > > > certain user agents. > > > > > > Cutting and pasting UTF-8 text between different applications > > > can have unexpected results, even if both applications are > > > nominally UTF-8 aware. > > > > > > Useful links > > > Unicode FAQ about the Byte Order Mark: > > > http://www.unicode.org/unicode/faq/utf_bom.html > > > > > > Microsoft > > > documentation about the Byte Order > > > Markhttp://msdn.microsoft.com/library/default.asp?url=/library > > /en-us/intl/unicode_42jv.asp > > > > > > Apache content negotiation documentation: > > > http://httpd.apache.org/docs/content-negotiation.html > > > > > > > > > > > > BBCi > > > at http://www.bbc.co.uk/ > > > > > > This e-mail (and any attachments) is confidential and may > > > contain personal views which are not the views of the BBC > > > unless specifically stated. If you have received it in error, > > > please delete it from your system. > > > Do not use, copy or disclose the information in any way nor > > > act in reliance on it and notify the sender immediately. > > > Please note that the BBC monitors e-mails sent or received. > > > Further communication will signify your consent to this. > > > > >
Received on Monday, 17 November 2003 23:26:30 UTC