Re: New FAQ: Removing UTF-8 BOM from Jungshik Shin on 2003-11-05 (public-i18n-geo@w3.org from November 2003)

From: Jungshik Shin <jshin@i18nl10n.com>
Date: Wed, 5 Nov 2003 23:33:18 +0900 (KST)
To: Deborah Cawkwell <deborah.cawkwell@bbc.co.uk>
Cc: public-i18n-geo@w3.org
Message-ID: <Pine.LNX.4.58.0311052210270.12721@jshin.net>

On Wed, 5 Nov 2003, Deborah Cawkwell wrote:

> Comments on the draft FAQ below are welcomed.

Thanks for a new FAQ. Here are my comments.

> interpretation of the file's contents, because each character in the file
> is composed of pairs of bytes of data.  Also, the order in which these

  Instead of 'pairs of bytes of data', 'two or four bytes' would be better.

> The HTTP header charset declaration, or HTML charset declaration (in the
> absence of the HTTP header charset declaration, which takes precedence),
> should normally be used to indicate the encoding. Therefore, if your UTF-8
> encoded web page displays an  unwanted blank line at the top, and you
> have an editor capable of displaying the Unicode BOM as described above,
> you should remove from the beginning of the file the three characters
> displayed as ï»¿.

 I'm not sure if you intentionally sent your email in ISO-8859-1 or
meant to send it in UTF-8 but forgot to set the character encoding.
How UTF-8 BOM is displayed depends on several different factors and it's
displayed as 'ï»¿' only when your 'editor' assumes your UTF-8 file is
in ISO-8859-1.  Depending on editors, it could be risky to lie that
UTF-8 file is in ISO-8859-1 in that the content can be lost.

> Alternatively, you can use a Perl script to remove the
> characters. [Richard - what extra benefit do you gain from this, rather
> than simply deleting.]

  Using perl or any other scripting tools/command line tools enable you
to remove BOM in batch. If you have a lot of files to fix, fixing them
all manually is rather tedious. For instance, with 'sed', you can do

   sed '1 s/^...//' filename > filename.new ; mv filename.new filename

(some versions of sed have '-i'  - in-place editing - option so that you can just do
  sed -i -e '1 s/^...//' filename
)

assuming that sed is not yet properly i18nized to understand UTF-8.
Obviously, you should run the above only on files you're sure have BOM
at the beginning. To work around that problem in sh, you can enclose
the above something like the following as well as for loop (or filter
like find .... | xargs)

 if [ x`head -1 filename | hexdump |\
       egrep '^000000 (bbef ..bf|efbb bf)'` != 'x' ]; then
    .....
 fi

With Perl (5.6 or later), it'd be

  perl -pie '(1 == $.) && s/^\x{FEFF}//' filename

With Perl (earlier than 5.6), it'd be

  perl -pie '(1 == $.) && s/^\xEF\xBB\xBF//' filename

  Jungshik

Received on Wednesday, 5 November 2003 09:33:31 UTC