Re: New FAQ: Removing UTF-8 BOM

Hello Jungshik, others,

First, Happy New Year.

This issue has been bugging me for quite a while, but I didn't
get around to write you in the old year.

At 22:54 03/11/06 +0900, Jungshik Shin wrote:

>Martin Duerst wrote:

> >> > prompt>  perl -pi~ -0777 -e "s/^\xEF\xBB\xBF//s;" filewithbom.html

>Anyway, I knew there was a change made between 5.8 and 5.8.1,
>but it was different from what I thought it had been.  See
>
>http://dev.perl.org/perl5/news/2003/perl-5.8.1.html#Core_Enhancements
>
>for details.  In short, unless you explicitly ask for 'UTF-8 file I/O',
>Perl 5.8.1 (or later) does not use it.
>
>Your script should work except when Perl 5.8 is run
>under UTF-8-based locale

I have not been able to verify this. I have run that script under
Perl 5.8.0, and did not notice any problems, even when running
under an UTF-8 locale.


>and when Perl 5.8.1 is run with PERL_UNICODE
>environment variable is defined.

That indeed didn't work. It worked when replacing
     s/^\xEF\xBB\xBF//s;
with
     s/^\x{FEFF}//s;

One way to do things might be to just leave both of these
statements in. But some older versions of Perl might not
grok the later.

I'm writing to the perl-unicode list to check this out in
more detail, and I'm copying you (but not the list).


>The most version-independent/locale-independent recipe (on Unix) is to use
>the following 'one liner' ('\' is for the line continuation so that it can
>be removed if you type it in a single line. Most people on this
>list may be aware of that, but this is for FAQ....)
>
>prompt> env LC_ALL=C PERL_UNICODE= \
>         perl -pi~ -0777 -e "s/^\xEF\xBB\xBF//s;" filename.html
>
>LC_ALL=C is necessary because the majority of Linux distributions still
>have Perl 5.8.0 and many Linux users nowadays use  UTF-8-based
>locales.

I think that the one-liner aspect is not that important, having
a single program that works with all versions is probably more
important. We want to give people something that they can use
very easily.

Regards,    Martin.

Received on Friday, 2 January 2004 16:09:41 UTC