- From: <touch@isi.edu>
- Date: Wed, 18 Sep 1996 11:17:05 -0700
- To: mcurts@mail.telis.org
- Cc: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
> Can you help me? I'm looking for a utility that will remove HTTP codes
> from documents and convert them to plain ASCII.
> --
> Mark Curts
Well, in case anyone else is interested, I wrote a PERL script to
remove HTML from text. (HTTP is the protocol, I presume you meant
HTML):
#!/local/new/bin/perl
# J. Touch
# USC/ISI
# 8/96
# Removes HTML codes from text input
while ($line = <>) {
# "eatfront" is set when a code spans multiple lines
# if already inside HTML code...
if ($eatfront == 1) {
# delete through the terminator ">", if found
if ($line =~ s/^[^>]*>//o) {
$eatfront = 0;
} else {
# otherwise delete all and keep looking on next line
$line = "";
}
}
# eat everything inside "<>"'s
$line =~ s/<[^>]*>//go;
# if there is a < without a matching >, eat it and keep looking
if ($line =~ s/<[^>]*$//o) {
$eatfront = 1;
}
print $line;
}
----------------------------------------------------------------------
Joe Touch - touch@isi.edu http://www.isi.edu/~touch/
ISI / Project Leader, ATOMIC-2, LSAM http://www.isi.edu/atomic2/
USC / Research Assistant Prof. http://www.isi.edu/lsam/
Received on Wednesday, 18 September 1996 11:23:35 UTC