- From: Jungshik Shin <jshin@i18nl10n.com>
- Date: Thu, 6 Nov 2003 22:54:09 +0900 (KST)
- To: duerst@w3.org
- Cc: public-i18n-geo@w3.org
Martin Duerst wrote:
> At 23:39 03/11/05 +0900, Jungshik Shin wrote:
>
>> On Wed, 5 Nov 2003, Martin Duerst wrote:
>>
>> > It can even be typed directly, as:
>> >
>> > prompt> perl -pi~ -0777 -e "s/^\xEF\xBB\xBF//s;" filewithbom.html
>>
>> Well, this doesn't work with Perl 5.6 or later because in Perl 5.6
>> or later, the native representation of characters is UTF-8.
I stand corrected. I did experiment with both my and your
scripts under UTF-8 locale and C/POSIX locale, but a subtle bug with
's//' in Perl 5.8.0 led me to the incorrect conclusion.
> It would very much surprise me if there were no way to say
> inside a perl program that input and output should be treated
> as binary.
Phew, it turned out that it's quite complicated. 'use bytes' and
'binmode' are supposed to do the trick, but somehow I couldn't
make it work in Perl 5.8.0
Anyway, I knew there was a change made between 5.8 and 5.8.1,
but it was different from what I thought it had been. See
http://dev.perl.org/perl5/news/2003/perl-5.8.1.html#Core_Enhancements
for details. In short, unless you explicitly ask for 'UTF-8 file I/O',
Perl 5.8.1 (or later) does not use it.
Your script should work except when Perl 5.8 is run
under UTF-8-based locale and when Perl 5.8.1 is run with PERL_UNICODE
environment variable is defined.
The most version-independent/locale-independent recipe (on Unix) is to use
the following 'one liner' ('\' is for the line continuation so that it can
be removed if you type it in a single line. Most people on this
list may be aware of that, but this is for FAQ....)
prompt> env LC_ALL=C PERL_UNICODE= \
perl -pi~ -0777 -e "s/^\xEF\xBB\xBF//s;" filename.html
LC_ALL=C is necessary because the majority of Linux distributions still
have Perl 5.8.0 and many Linux users nowadays use UTF-8-based
locales.
>> Even in
>> earlier Perl, it has a problem of removing U+FEFF at places other than
>> the very beginning of files.
>
>
> No, that's what the -0777 option is for, which makes the
> whole file being treated as a single line.
Sorry I didn't know that. That's nice to know. I gave a brief thought
to changing the line delimeter inside a script (to get the same effect
as -0777), but it seemed to me that it's simpler to just use '$. == 1'
condition. However, for sure, '-0777' is handy.
Jungshik
Received on Thursday, 6 November 2003 08:54:11 UTC