- From: Jungshik Shin <jshin@i18nl10n.com>
- Date: Thu, 6 Nov 2003 22:54:09 +0900 (KST)
- To: duerst@w3.org
- Cc: public-i18n-geo@w3.org
Martin Duerst wrote: > At 23:39 03/11/05 +0900, Jungshik Shin wrote: > >> On Wed, 5 Nov 2003, Martin Duerst wrote: >> >> > It can even be typed directly, as: >> > >> > prompt> perl -pi~ -0777 -e "s/^\xEF\xBB\xBF//s;" filewithbom.html >> >> Well, this doesn't work with Perl 5.6 or later because in Perl 5.6 >> or later, the native representation of characters is UTF-8. I stand corrected. I did experiment with both my and your scripts under UTF-8 locale and C/POSIX locale, but a subtle bug with 's//' in Perl 5.8.0 led me to the incorrect conclusion. > It would very much surprise me if there were no way to say > inside a perl program that input and output should be treated > as binary. Phew, it turned out that it's quite complicated. 'use bytes' and 'binmode' are supposed to do the trick, but somehow I couldn't make it work in Perl 5.8.0 Anyway, I knew there was a change made between 5.8 and 5.8.1, but it was different from what I thought it had been. See http://dev.perl.org/perl5/news/2003/perl-5.8.1.html#Core_Enhancements for details. In short, unless you explicitly ask for 'UTF-8 file I/O', Perl 5.8.1 (or later) does not use it. Your script should work except when Perl 5.8 is run under UTF-8-based locale and when Perl 5.8.1 is run with PERL_UNICODE environment variable is defined. The most version-independent/locale-independent recipe (on Unix) is to use the following 'one liner' ('\' is for the line continuation so that it can be removed if you type it in a single line. Most people on this list may be aware of that, but this is for FAQ....) prompt> env LC_ALL=C PERL_UNICODE= \ perl -pi~ -0777 -e "s/^\xEF\xBB\xBF//s;" filename.html LC_ALL=C is necessary because the majority of Linux distributions still have Perl 5.8.0 and many Linux users nowadays use UTF-8-based locales. >> Even in >> earlier Perl, it has a problem of removing U+FEFF at places other than >> the very beginning of files. > > > No, that's what the -0777 option is for, which makes the > whole file being treated as a single line. Sorry I didn't know that. That's nice to know. I gave a brief thought to changing the line delimeter inside a script (to get the same effect as -0777), but it seemed to me that it's simpler to just use '$. == 1' condition. However, for sure, '-0777' is handy. Jungshik
Received on Thursday, 6 November 2003 08:54:11 UTC