RE: New FAQ: Removing UTF-8 BOM from Deborah Cawkwell on 2003-11-05 (public-i18n-geo@w3.org from November 2003)

From: Deborah Cawkwell <deborah.cawkwell@bbc.co.uk>
Date: Wed, 5 Nov 2003 15:35:44 -0000
To: "Jungshik Shin" <jshin@i18nl10n.com>
Cc: <public-i18n-geo@w3.org>
Message-ID: <418B7E44473AC34488C9E730D09FF3CF0127E9FF@bbcxue204.bu.bbc.co.uk>

Agreed "two or four bytes" is better.

Re the Perl script, batching makes sense. I was so locked into thinking about our process, which is mostly automated, where we only remove the BOM from a few hand-coded elements, such as navigation, that I hadn't considered multiple BOM-ed files.  

-----Original Message-----
From: Jungshik Shin [mailto:jshin@i18nl10n.com] 
Sent: 05 November 2003 14:33
To: Deborah Cawkwell
Cc: public-i18n-geo@w3.org
Subject: Re: New FAQ: Removing UTF-8 BOM

On Wed, 5 Nov 2003, Deborah Cawkwell wrote:

> Comments on the draft FAQ below are welcomed.

Thanks for a new FAQ. Here are my comments.

> interpretation of the file's contents, because each character in the file
> is composed of pairs of bytes of data.  Also, the order in which these

  Instead of 'pairs of bytes of data', 'two or four bytes' would be better.

> The HTTP header charset declaration, or HTML charset declaration (in the
> absence of the HTTP header charset declaration, which takes precedence),
> should normally be used to indicate the encoding. Therefore, if your UTF-8
> encoded web page displays an  unwanted blank line at the top, and you
> have an editor capable of displaying the Unicode BOM as described above,
> you should remove from the beginning of the file the three characters
> displayed as .

 I'm not sure if you intentionally sent your email in ISO-8859-1 or
meant to send it in UTF-8 but forgot to set the character encoding.
How UTF-8 BOM is displayed depends on several different factors and it's
displayed as '' only when your 'editor' assumes your UTF-8 file is
in ISO-8859-1.  Depending on editors, it could be risky to lie that
UTF-8 file is in ISO-8859-1 in that the content can be lost.

> Alternatively, you can use a Perl script to remove the
> characters. [Richard - what extra benefit do you gain from this, rather
> than simply deleting.]

  Using perl or any other scripting tools/command line tools enable you
to remove BOM in batch. If you have a lot of files to fix, fixing them
all manually is rather tedious. For instance, with 'sed', you can do

   sed '1 s/^...//' filename > filename.new ; mv filename.new filename

(some versions of sed have '-i'  - in-place editing - option so that you can just do
  sed -i -e '1 s/^...//' filename
)

assuming that sed is not yet properly i18nized to understand UTF-8.
Obviously, you should run the above only on files you're sure have BOM
at the beginning. To work around that problem in sh, you can enclose
the above something like the following as well as for loop (or filter
like find .... | xargs)

 if [ x`head -1 filename | hexdump |\
       egrep '^000000 (bbef ..bf|efbb bf)'` != 'x' ]; then
    .....
 fi

With Perl (5.6 or later), it'd be

  perl -pie '(1 == $.) && s/^\x{FEFF}//' filename

With Perl (earlier than 5.6), it'd be

  perl -pie '(1 == $.) && s/^\xEF\xBB\xBF//' filename

  Jungshik

BBCi at http://www.bbc.co.uk/

This e-mail (and any attachments) is confidential and may contain
personal views which are not the views of the BBC unless specifically
stated.
If you have received it in error, please delete it from your system. 
Do not use, copy or disclose the information in any way nor act in
reliance on it and notify the sender immediately. Please note that the
BBC monitors e-mails sent or received. 
Further communication will signify your consent to this.

Received on Wednesday, 5 November 2003 10:36:25 UTC