- From: Deborah Cawkwell <deborah.cawkwell@bbc.co.uk>
- Date: Wed, 5 Nov 2003 15:35:44 -0000
- To: "Jungshik Shin" <jshin@i18nl10n.com>
- Cc: <public-i18n-geo@w3.org>
Agreed "two or four bytes" is better. Re the Perl script, batching makes sense. I was so locked into thinking about our process, which is mostly automated, where we only remove the BOM from a few hand-coded elements, such as navigation, that I hadn't considered multiple BOM-ed files. -----Original Message----- From: Jungshik Shin [mailto:jshin@i18nl10n.com] Sent: 05 November 2003 14:33 To: Deborah Cawkwell Cc: public-i18n-geo@w3.org Subject: Re: New FAQ: Removing UTF-8 BOM On Wed, 5 Nov 2003, Deborah Cawkwell wrote: > Comments on the draft FAQ below are welcomed. Thanks for a new FAQ. Here are my comments. > interpretation of the file's contents, because each character in the file > is composed of pairs of bytes of data. Also, the order in which these Instead of 'pairs of bytes of data', 'two or four bytes' would be better. > The HTTP header charset declaration, or HTML charset declaration (in the > absence of the HTTP header charset declaration, which takes precedence), > should normally be used to indicate the encoding. Therefore, if your UTF-8 > encoded web page displays an unwanted blank line at the top, and you > have an editor capable of displaying the Unicode BOM as described above, > you should remove from the beginning of the file the three characters > displayed as . I'm not sure if you intentionally sent your email in ISO-8859-1 or meant to send it in UTF-8 but forgot to set the character encoding. How UTF-8 BOM is displayed depends on several different factors and it's displayed as '' only when your 'editor' assumes your UTF-8 file is in ISO-8859-1. Depending on editors, it could be risky to lie that UTF-8 file is in ISO-8859-1 in that the content can be lost. > Alternatively, you can use a Perl script to remove the > characters. [Richard - what extra benefit do you gain from this, rather > than simply deleting.] Using perl or any other scripting tools/command line tools enable you to remove BOM in batch. If you have a lot of files to fix, fixing them all manually is rather tedious. For instance, with 'sed', you can do sed '1 s/^...//' filename > filename.new ; mv filename.new filename (some versions of sed have '-i' - in-place editing - option so that you can just do sed -i -e '1 s/^...//' filename ) assuming that sed is not yet properly i18nized to understand UTF-8. Obviously, you should run the above only on files you're sure have BOM at the beginning. To work around that problem in sh, you can enclose the above something like the following as well as for loop (or filter like find .... | xargs) if [ x`head -1 filename | hexdump |\ egrep '^000000 (bbef ..bf|efbb bf)'` != 'x' ]; then ..... fi With Perl (5.6 or later), it'd be perl -pie '(1 == $.) && s/^\x{FEFF}//' filename With Perl (earlier than 5.6), it'd be perl -pie '(1 == $.) && s/^\xEF\xBB\xBF//' filename Jungshik BBCi at http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
Received on Wednesday, 5 November 2003 10:36:25 UTC