W3C home > Mailing lists > Public > ietf-http-wg-old@w3.org > September to December 1996

Re: Conversion Program

From: <touch@isi.edu>
Date: Wed, 18 Sep 1996 11:17:05 -0700
Message-Id: <199609181817.AA01838@ash.isi.edu>
To: mcurts@mail.telis.org
Cc: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
> Can you help me?  I'm looking for a utility that will remove HTTP codes 
> from documents and convert them to plain ASCII.
> -- 
> Mark Curts

Well, in case anyone else is interested, I wrote a PERL script to
remove HTML from text. (HTTP is the protocol, I presume you meant
HTML):


#!/local/new/bin/perl

# J. Touch
# USC/ISI
# 8/96

# Removes HTML codes from text input

while ($line = <>) {
        # "eatfront" is set when a code spans multiple lines
        # if already inside HTML code...
        if ($eatfront == 1) {
                # delete through the terminator ">", if found
                if ($line =~ s/^[^>]*>//o) {
                        $eatfront = 0;
                } else {
                        # otherwise delete all and keep looking on next line
                        $line = "";
                }
        }
        # eat everything inside "<>"'s
        $line =~ s/<[^>]*>//go;
        # if there is a < without a matching >,  eat it and keep looking
        if ($line =~ s/<[^>]*$//o) {
                $eatfront = 1;
        }
        print $line;
}
----------------------------------------------------------------------
Joe Touch - touch@isi.edu		    http://www.isi.edu/~touch/
ISI / Project Leader, ATOMIC-2, LSAM       http://www.isi.edu/atomic2/
USC / Research Assistant Prof.                http://www.isi.edu/lsam/
Received on Wednesday, 18 September 1996 11:23:35 EDT

This archive was generated by hypermail pre-2.1.9 : Wednesday, 24 September 2003 06:32:13 EDT