W3C home > Mailing lists > Public > public-i18n-geo@w3.org > November 2003

Re: New FAQ: Removing UTF-8 BOM

From: Jungshik Shin <jshin@i18nl10n.com>
Date: Thu, 6 Nov 2003 22:54:09 +0900 (KST)
To: duerst@w3.org
Cc: public-i18n-geo@w3.org
Message-ID: <Pine.LNX.4.58.0311062146290.12721@jshin.net>

Martin Duerst wrote:
> At 23:39 03/11/05 +0900, Jungshik Shin wrote:
>
>> On Wed, 5 Nov 2003, Martin Duerst wrote:
>>
>> > It can even be typed directly, as:
>> >
>> > prompt>  perl -pi~ -0777 -e "s/^\xEF\xBB\xBF//s;" filewithbom.html
>>
>>   Well, this doesn't work with Perl 5.6 or later because in Perl 5.6
>> or later, the native representation of characters is UTF-8.

  I stand corrected.  I did experiment with both my and your
scripts under UTF-8 locale and C/POSIX locale, but a subtle bug with
's//' in Perl 5.8.0  led me to the incorrect conclusion.

> It would very much surprise me if there were no way to say
> inside a perl program that input and output should be treated
> as binary.

Phew, it turned out that it's quite complicated.  'use bytes' and
'binmode' are supposed to do the trick, but somehow I couldn't
make it work in Perl 5.8.0

Anyway, I knew there was a change made between 5.8 and 5.8.1,
but it was different from what I thought it had been.  See

http://dev.perl.org/perl5/news/2003/perl-5.8.1.html#Core_Enhancements

for details.  In short, unless you explicitly ask for 'UTF-8 file I/O',
Perl 5.8.1 (or later) does not use it.

Your script should work except when Perl 5.8 is run
under UTF-8-based locale and when Perl 5.8.1 is run with PERL_UNICODE
environment variable is defined.

The most version-independent/locale-independent recipe (on Unix) is to use
the following 'one liner' ('\' is for the line continuation so that it can
be removed if you type it in a single line. Most people on this
list may be aware of that, but this is for FAQ....)

prompt> env LC_ALL=C PERL_UNICODE= \
        perl -pi~ -0777 -e "s/^\xEF\xBB\xBF//s;" filename.html

LC_ALL=C is necessary because the majority of Linux distributions still
have Perl 5.8.0 and many Linux users nowadays use  UTF-8-based
locales.


>> Even in
>> earlier Perl, it has a problem of removing U+FEFF at places other than
>> the very beginning of files.
>
>
> No, that's what the -0777 option is for, which makes the
> whole file being treated as a single line.

   Sorry I didn't know that. That's nice to know. I gave a brief thought
to changing the line delimeter inside a script (to get the same effect
as -0777), but it seemed to me that it's simpler to just use '$. == 1'
condition. However, for sure, '-0777' is handy.

   Jungshik
Received on Thursday, 6 November 2003 08:54:11 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:28:00 UTC