Re: HTML Strippers

rmesa@best.com (Robert A. Mesa) said:

> Is there a utility to strip away HTML tags. 

if you can't find anything else, the following
perl script (which i call 'unhtml') will work ok:

  #!/usr/bin/perl

  $* = 1;		# turn on multi-line string matching
  undef($/);		# turn off paragraph-mode reading
  $_ = <>;		# read in entire file
  s/<[^>]+>//g;		# remove <...>'s in the entire string
  print;		# print the file

this would be run like:

  unhtml file.html >file.txt

it's not by any means perfect -- angle brackets
within quoted strings will be munged, and nothing
is done with entities (like &amp;).

another option, especially if you want the html
code to be formatted, is to use the lynx browser
in 'dump' mode:

  % lynx -dump file.html >file.txt

hope this helps.

--
John Labovitz
Technical Services Manager, Global Network Navigator <http://gnn.com/>
O'Reilly & Associates, Sebastopol, California, USA (+1 707 829 0515)

Received on Wednesday, 26 April 1995 01:41:02 UTC