Re: HTML Strippers

John Labovitz (johnl@ora.com)
Tue, 25 Apr 1995 21:40:36 -0700


Message-Id: <199504260440.VAA02920@bohemia.west.ora.com>
From: John Labovitz <johnl@ora.com>
To: rmesa@best.com
Cc: Multiple recipients of list <www-html@www10.w3.org>
Subject: Re: HTML Strippers 
In-Reply-To: Your message of "Wed, 26 Apr 1995 01:06:25 +0500."
             <199504260458.VAA14137@shell1.best.com> 
Date: Tue, 25 Apr 1995 21:40:36 -0700

rmesa@best.com (Robert A. Mesa) said:

> Is there a utility to strip away HTML tags. 

if you can't find anything else, the following
perl script (which i call 'unhtml') will work ok:

  #!/usr/bin/perl

  $* = 1;		# turn on multi-line string matching
  undef($/);		# turn off paragraph-mode reading
  $_ = <>;		# read in entire file
  s/<[^>]+>//g;		# remove <...>'s in the entire string
  print;		# print the file

this would be run like:

  unhtml file.html >file.txt

it's not by any means perfect -- angle brackets
within quoted strings will be munged, and nothing
is done with entities (like &amp;).

another option, especially if you want the html
code to be formatted, is to use the lynx browser
in 'dump' mode:

  % lynx -dump file.html >file.txt

hope this helps.

--
John Labovitz
Technical Services Manager, Global Network Navigator <http://gnn.com/>
O'Reilly & Associates, Sebastopol, California, USA (+1 707 829 0515)