Modified normalize_newlines to work on PC and Mac, too. from Martin Duerst on 2001-05-23 (www-validator@w3.org from May 2001)

From: Martin Duerst <duerst@w3.org>
Date: Wed, 23 May 2001 09:02:14 +0900
To: www-validator@w3.org
Message-Id: <4.2.0.58.J.20010522234046.039a2390@sh.w3.mag.keio.ac.jp>

Hello everybody,

[I had some terrible problems, it seems, to get this in right.
Many thanks to Terje for catching this!]

I'm not sure the details in this mail are appropriate for this
list, please just tell me if you get bored :-).

In sub normalize_newlines, the following two lines

   $file =~ s(\015\012){\n}g; # Turn ASCII CRLF into native newline.
   $file =~ s(\015)    {\n}g; # Turn ASCII CR   into native newline.

pretended to turn various line endings into native convention newlines,
and they indeed did so on Unix systems. But they didn't do that on PCs
or Macs. Here is what happened:

Start   Mac     PC      Unix
CRLF    CR      CRCRLF  LF
CR      CR      CRLF    LF
LF      LF      LF      LF

desired CR      CRLF    LF

This can be got by replacing the two lines above by

   $file =~ s(\015\012?|\012){\n}g; # Turn CRLF/CR/LF into native newline.

I have checked this change in, together with some tweaks to the comments
at the start of the subroutine.

The above regular expression may puzzle some, but it works. It could
also be written (\015\012|\015|\012) or (\015\012|\012|\015)
[but beware of (\015|\015\012|\012) and similar, and if you want
to know why, please read Jeffrey Friedl's Mastering Regular Expressions.


Of course, the whole subroutine, currently reading
sub normalize_newlines {
   my $file = shift;

   $file =~ s(\015\012?|\012){\n}g; # Turn CRLF/CR/LF into native newline.

   return [split /\n/, $file];
}

can be further simplified to read

sub normalize_newlines {
   my $file = shift;

   return [split /\015\012?|\012/, $file];
}

and then I guess further to

sub normalize_newlines {
   return [split /\015\012?|\012/, shift];
}

but once we are here, we might be able to get rid of the subroutine
altogether.


Regards,   Martin.

Received on Tuesday, 22 May 2001 20:03:00 UTC