Re: HTML Strippers from John Labovitz on 1995-04-26 (www-html@w3.org from April 1995)

From: John Labovitz <johnl@ora.com>
Date: Tue, 25 Apr 1995 21:40:36 -0700
To: rmesa@best.com
Cc: Multiple recipients of list <www-html@www10.w3.org>
Message-Id: <199504260440.VAA02920@bohemia.west.ora.com>

rmesa@best.com (Robert A. Mesa) said:

> Is there a utility to strip away HTML tags. 

if you can't find anything else, the following
perl script (which i call 'unhtml') will work ok:

  #!/usr/bin/perl

  $* = 1;		# turn on multi-line string matching
  undef($/);		# turn off paragraph-mode reading
  $_ = <>;		# read in entire file
  s/<[^>]+>//g;		# remove <...>'s in the entire string
  print;		# print the file

this would be run like:

  unhtml file.html >file.txt

it's not by any means perfect -- angle brackets
within quoted strings will be munged, and nothing
is done with entities (like &amp;).

another option, especially if you want the html
code to be formatted, is to use the lynx browser
in 'dump' mode:

  % lynx -dump file.html >file.txt

hope this helps.

--
John Labovitz
Technical Services Manager, Global Network Navigator <http://gnn.com/>
O'Reilly & Associates, Sebastopol, California, USA (+1 707 829 0515)

Received on Wednesday, 26 April 1995 01:41:02 UTC