Re: New FAQ: Removing UTF-8 BOM from Tex Texin on 2003-11-05 (public-i18n-geo@w3.org from November 2003)

From: Tex Texin <tex@i18nguy.com>
Date: Wed, 05 Nov 2003 09:56:25 -0500
To: Jungshik Shin <jshin@i18nl10n.com>
Cc: Deborah Cawkwell <deborah.cawkwell@bbc.co.uk>, public-i18n-geo@w3.org
Message-ID: <3FA90F99.B1F828FA@i18nguy.com>
Hi Jungshik,

1) yes, utf-16 is pairs of bytes, utf-32 is quadruplets.
2) yes, the characters will display differently, depending on encoding and font
of the editor.
Maybe we should use a graphic to show the mistreatment(s).
Also, they are being mistreated as characters, but we should refer to them as
bytes since they are not representing characters.

3) For the faq we shouldn't use scripts that look "something like..." or have
too many version dependencies. So we can't use the sed script.
Also, thanks for pointing out the problem with the perl script in your other
mail.
If it is not safe and reliable we shouldn't put it in the faq at all.

tex

Jungshik Shin wrote:
> 
> On Wed, 5 Nov 2003, Deborah Cawkwell wrote:
> 
> > Comments on the draft FAQ below are welcomed.
> 
> Thanks for a new FAQ. Here are my comments.
> 
> > interpretation of the file's contents, because each character in the file
> > is composed of pairs of bytes of data.  Also, the order in which these
> 
>   Instead of 'pairs of bytes of data', 'two or four bytes' would be better.
> 
> > The HTTP header charset declaration, or HTML charset declaration (in the
> > absence of the HTTP header charset declaration, which takes precedence),
> > should normally be used to indicate the encoding. Therefore, if your UTF-8
> > encoded web page displays an  unwanted blank line at the top, and you
> > have an editor capable of displaying the Unicode BOM as described above,
> > you should remove from the beginning of the file the three characters
> > displayed as ï»¿.
> 
>  I'm not sure if you intentionally sent your email in ISO-8859-1 or
> meant to send it in UTF-8 but forgot to set the character encoding.
> How UTF-8 BOM is displayed depends on several different factors and it's
> displayed as 'ï»¿' only when your 'editor' assumes your UTF-8 file is
> in ISO-8859-1.  Depending on editors, it could be risky to lie that
> UTF-8 file is in ISO-8859-1 in that the content can be lost.
> 
> > Alternatively, you can use a Perl script to remove the
> > characters. [Richard - what extra benefit do you gain from this, rather
> > than simply deleting.]
> 
>   Using perl or any other scripting tools/command line tools enable you
> to remove BOM in batch. If you have a lot of files to fix, fixing them
> all manually is rather tedious. For instance, with 'sed', you can do
> 
>    sed '1 s/^...//' filename > filename.new ; mv filename.new filename
> 
> (some versions of sed have '-i'  - in-place editing - option so that you can just do
>   sed -i -e '1 s/^...//' filename
> )
> 
> assuming that sed is not yet properly i18nized to understand UTF-8.
> Obviously, you should run the above only on files you're sure have BOM
> at the beginning. To work around that problem in sh, you can enclose
> the above something like the following as well as for loop (or filter
> like find .... | xargs)
> 
>  if [ x`head -1 filename | hexdump |\
>        egrep '^000000 (bbef ..bf|efbb bf)'` != 'x' ]; then
>     .....
>  fi
> 
> With Perl (5.6 or later), it'd be
> 
>   perl -pie '(1 == $.) && s/^\x{FEFF}//' filename
> 
> With Perl (earlier than 5.6), it'd be
> 
>   perl -pie '(1 == $.) && s/^\xEF\xBB\xBF//' filename
> 
>   Jungshik

-- 
-------------------------------------------------------------
Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
Xen Master                          http://www.i18nGuy.com
                         
XenCraft		            http://www.XenCraft.com
Making e-Business Work Around the World
-------------------------------------------------------------
Received on Wednesday, 5 November 2003 09:57:19 UTC