- From: John Labovitz <johnl@ora.com>
- Date: Tue, 25 Apr 1995 21:40:36 -0700
- To: rmesa@best.com
- Cc: Multiple recipients of list <www-html@www10.w3.org>
rmesa@best.com (Robert A. Mesa) said: > Is there a utility to strip away HTML tags. if you can't find anything else, the following perl script (which i call 'unhtml') will work ok: #!/usr/bin/perl $* = 1; # turn on multi-line string matching undef($/); # turn off paragraph-mode reading $_ = <>; # read in entire file s/<[^>]+>//g; # remove <...>'s in the entire string print; # print the file this would be run like: unhtml file.html >file.txt it's not by any means perfect -- angle brackets within quoted strings will be munged, and nothing is done with entities (like &). another option, especially if you want the html code to be formatted, is to use the lynx browser in 'dump' mode: % lynx -dump file.html >file.txt hope this helps. -- John Labovitz Technical Services Manager, Global Network Navigator <http://gnn.com/> O'Reilly & Associates, Sebastopol, California, USA (+1 707 829 0515)
Received on Wednesday, 26 April 1995 01:41:02 UTC