RE: UTF-8 BOM FAQ from Martin Duerst on 2003-11-18 (public-i18n-geo@w3.org from November 2003)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 17 Nov 2003 23:21:37 -0500
To: <ishida@w3.org>, <public-i18n-geo@w3.org>
Message-Id: <4.2.0.58.J.20031117215517.05cc15b8@localhost>
Hello Richard, Deborah,

Very good work. Some comments below.

At 08:01 03/11/17 +0000, Richard Ishida wrote:

>Deborah's FAQ is at
>http://www.w3.org/International/questions/qa-utf8-bom.html

I would like to change the question from:
"When I'm using a UTF-8 encoding, why does an extra line appear at the top 
of my web page, and how do I remove it?"
to
"When I'm using a UTF-8 encoding, why may an extra line appear at the top 
of my web page, and how do I remove it?"

(i.e. change from 'does' to 'may')

I think it's important we don't give, in one way or another,
the impression that there is something wrong with UTF-8.

Answer, second paragraph: It's not clear to me whether 'interpreting
UTF-8 correctly' refers to "interpreting the file as being encoded
in UTF-8", or "interpreting the BOM correctly (i.e. not showing it)".

"Latin 1 ISO 8859-1 character encoding": use clearer terminology

"hexadecimal byte values" -> "byte values in hexadecimal notation"

"remove the extraneous characters, which represent the UTF-8 signature"
-> "remove the extraneous bytes, which represent the UTF-8 signature"

Background: the first paragraph says the BOM is a sequence of bytes.
the second paragraph says it's a character.

"In UTF-16 and UTF-32 encodings": If the reader hasn't got the message
that we are speaking about encodings, we probably need do fix something
earlier on. Please change to "UTF-16 and UTF-32" (several similar cases).

"Each character in the file is composed of 2 to 4 bytes of data":
probably better to replace 'composed of' with 'represented by'.

"You will find that Windows Notepad  and Helios Textpad will":
Are there others? Do we know that there are no others? Word as
"some text editors such as ...".

"Cutting and pasting UTF-8 text between different applications can have
unexpected results, even if both applications are nominally UTF-8 aware."
Can we be more specific here? Otherwise, that is more confusing than helpful.

Regards,    Martin.


>RI
>
>============
>Richard Ishida
>W3C
>
>contact info: http://www.w3.org/People/Ishida/
>
>http://www.w3.org/International/
>http://www.w3.org/International/geo/
>
>W3C Internationalization FAQs
>http://www.w3.org/International/questions.html
>RSS feed: http://www.w3.org/International/questions.rss
>
>
>
> > -----Original Message-----
> > From: public-i18n-geo-request@w3.org
> > [mailto:public-i18n-geo-request@w3.org] On Behalf Of Richard Ishida
> > Sent: 17 November 2003 07:59
> > To: 'Deborah Cawkwell'; public-i18n-geo@w3.org
> > Subject: RE: UTF-8 BOM FAQ
> >
> >
> >
> > Thanks Deborah !
> >
> > I have updated the FAQ online with the suggested changes.  I
> > think this is quite a lot better.  I haven't added the
> > background stuff yet though. There are three main reasons: a.
> > it seems rather long, b. I'm not sure whether such detail is
> > appropriate in *this* FAQ, c. I haven't had the opportunity
> > to read it in detail yet.  Anyone else reading this, please
> > send in your thoughts!
> >
> > I left in change marks so people can see what's different.
> >
> > Wrt "which browsers display the extra line when encountering
> > the UTF-8 BOM" try Netscape 4.8.
> >
> > RI
> >
> >
> > > -----Original Message-----
> > > From: public-i18n-geo-request@w3.org
> > > [mailto:public-i18n-geo-request@w3.org] On Behalf Of
> > Deborah Cawkwell
> > > Sent: 16 November 2003 22:18
> > > To: public-i18n-geo@w3.org
> > > Subject: UTF-8 BOM FAQ
> > >
> > >
> > >
> > > Hi All
> > >
> > > Apologies that the FAQ is in text.
> > >
> > > I have slanted the answer differently following feedback.
> > > - Following the conference call, I have not yet identified
> > > which browsers display the extra line when encountering the
> > UTF-8 BOM.
> >
> > Netscape 4.8 does
> >
> > > - Re removing the BOM, we have found no problem re-opening
> > > the file (in Notepad, which is the only text editor I know
> > > that displays readable text, eg Persian).
> > > - I feel strongly about including the UTF-8 table because it
> > > clarifies so much (for me anyway).
> > >
> > > For the next three days, I will be out of the office.
> > >
> > > For those on the conference call (& anyone else), a bit more
> > > localised information about UK Bonfire night and Guy Fawkes:
> > > http://www.bbc.co.uk/dna/h2g2/A199488.
> > >
> > > Deborah
> > >
> > > ------------------------------------
> > >
> > > FAQ: Unexpected blank lines or characters with UTF-8 encoding
> > > question - background - answer - by the way - useful links
> > >
> > > Question
> > > When I'm using a UTF-8 encoding, why does an extra line
> > > appear at the top of my web page, and how do I remove it?
> > >
> > > Answer
> > > See the Background information.
> > >
> > > This may be caused by the presence of a UTF-8 signature at
> > > the beginning of the file, which the user agent doesn't
> > > recognize. Note that a number of more recent browsers, such
> > > as the latest Windows-based versions of Internet Explorer,
> > > Mozilla (Netscape) and Opera, do not exhibit this behaviour.
> > >
> > > You may not be able to see the cause of the extra line or
> > > space in your editor, if it interprets UTF-8 correctly. An
> > > editor which does not interpret UTF-8 correctly, displays the
> > > UTF-8 signature according to its own character encoding
> > > setting. With the Latin 1 ISO 8859-1 character encoding, the
> > > signature displays as extraneous characters 鍮信 With a
> > > binary editor capable of displaying the hexadecimal byte
> > > values in the file, the UTF-8 signature displays as EF BB BF.
> > >
> > > To remove the extra line or spaces that appear in the
> > > browser, remove the extraneous characters, which represent
> > > the UTF-8 signature. You can remove them manually or with a
> > > script. One of the benefits of using a script is that you can
> > > remove the extraneous characters from multiple files.
> > >
> > > You should check thoroughly the result of removing the
> > > signature, bearing in mind that pages with a high proportion
> > > of Latin characters may look correct superficially, but that
> > > characters outside the ASCII compatibility range (U+0000 to
> > > U+007F) may be incorrectly encoded.
> > >
> > > If there is no evidence of a UTF-8 signature at the beginning
> > > of the file, then your problem lies elsewhere.
> > >
> > >
> > > Background
> > >
> > > An editor that does not correctly interpret Unicode
> > > (encodings: UTF-8, UTF-16, UTF-32) recognises each byte as
> > > referring to one character (some editors may assume two-bytes
> > > per character); the character referred to by that byte value
> > > depends on the encoding assumed by the editor. An editor that
> > > does correctly interpret UTF-8 recognises that a character
> > > reference can require 1-4 bytes. In UTF-8 encoding, the
> > > number of bytes used to refer to a Unique Scalar Value in the
> > > Unicode repertoire is determined by the first byte.
> > >
> > >     Unicode character                               UTF-8
> > > byte 1      UTF-8 byte 2    UTF-8 byte 3    UTF-8 byte 4
> > >
> > >     0000 to 007F (ASCII)                    01xxxxxx
> > >     0080 to 07FF
> > > 110xxxxx                    10xxxxxx
> > >     0800 to FFFF
> > > 1110xxxx                    10xxxxxx                        10xxxxxx
> > >     10000 to 10FFFF                         11110xxx
> > >             10xxxxxx                        10xxxxxx
> > >             10xxxxxx
> > >
> > > All Unicode characters encoded in UTF-8, which fall outside
> > > the ASCII compatibility range (0 to 127 decimal), have byte
> > > values greater than 127 decimal, which explains why they
> > > display as 'strange' characters in non-UTF-8 compliant
> > > editors and browsers.
> > >
> > > The UTF-8 signature is also known as the Unicode UTF-8 Byte
> > > Order Mark (BOM). UTF-8 is one encoding of the Unicode
> > > repertoire. Others include UTF-16 and UTF-32. All three
> > > encodings encode the same Unicode character repertoire, but
> > > they differ in the sequence of byte values which refer to
> > > that repertoire.
> > >
> > > - UTF-8 uses 1-4 bytes; the first byte of each encoded
> > > character determines how many subsequent bytes in the
> > > sequence are required.
> > > - UTF-16 uses 2 or 4 bytes; if more than two bytes are
> > > required, then the first two bytes refer to a reserved value
> > > in the Unicode repertoire, which indicate that the next two
> > > bytes should be used to obtain the final value.
> > > - UTF-32 always uses four bytes.
> > >
> > > All three Unicode encodings (UTF-8, UTF-16, UTF-32) can use
> > > the signature or BOM (Byte Order Mark). Whilst with UTF-8,
> > > the BOM serves one purpose as a 'signature' of Unicode, for
> > > UTF-16 and UTF-32, the BOM has further purpose. That purpose
> > > is to indicate the order in which the bytes should be read.
> > > This order varies according to processor architecture, which
> > > can be 'big' or 'little' 'endian':
> > >
> > >     - Macs - Motorola, PowerPC = big endian
> > >     - PCs - Intel = little endian
> > >     - UNIX - different processors, therefore big or little endian
> > >
> > > There are other Unicode encodings which do not require the
> > > presence of the BOM; these are "UTF-16LE", "UTF-16-BE",
> > > "UTF-32LE" and "UTF-32BE".
> > >
> > > By the way
> > > You will find that Windows Notepad and Helios Textpad will
> > > automatically add a UTF-8 signature to any file you save as UTF-8.
> > >
> > > A UTF-8 signature at the beginning of a CSS file can
> > > sometimes cause the initial rules in the file to fail on
> > > certain user agents.
> > >
> > > Cutting and pasting UTF-8 text between different applications
> > > can have unexpected results, even if both applications are
> > > nominally UTF-8 aware.
> > >
> > > Useful links
> > > Unicode FAQ about the Byte Order Mark:
> > > http://www.unicode.org/unicode/faq/utf_bom.html
> > >
> > > Microsoft
> > > documentation about the Byte Order
> > > Markhttp://msdn.microsoft.com/library/default.asp?url=/library
> > /en-us/intl/unicode_42jv.asp
> > >
> > > Apache content negotiation documentation:
> > > http://httpd.apache.org/docs/content-negotiation.html
> > >
> > >
> > >
> > > BBCi
> > > at http://www.bbc.co.uk/
> > >
> > > This e-mail (and any attachments) is confidential and may
> > > contain personal views which are not the views of the BBC
> > > unless specifically stated. If you have received it in error,
> > > please delete it from your system.
> > > Do not use, copy or disclose the information in any way nor
> > > act in reliance on it and notify the sender immediately.
> > > Please note that the BBC monitors e-mails sent or received.
> > > Further communication will signify your consent to this.
> > >
> >
Received on Monday, 17 November 2003 23:26:30 UTC