W3C home > Mailing lists > Public > html-tidy@w3.org > January to March 2003

Re: Need to Strip all HTML tages from a renderd web Page

From: Rijk van Geijtenbeek <rijk@opera.com>
Date: Thu, 06 Feb 2003 11:02:50 +0100
To: HTML-tidy list <html-tidy@w3.org>
Message-ID: <oprj53i0dcyoq9u9@localhost>

On Wed, 5 Feb 2003 15:43:01 -0500, Jamie Eagan <jamieeagan@agora-inc.com> 
wrote:

>> Is anyone aware of a utility to remove the content from a web page. We 
>> are
>> converting a large amount of content from an existing web site to a CM
>> system.  In the past my company has always done this manually by copying
>> the site content from a rendered page and copying to a txt editor like
>> Notepad (thereby stripping all the HTML) and then copying into the CM
>> editor.  We have the ability to load the information into the app if the
>> content is loaded as text.  Is anyone aware of a tool that can spider
>> through a site and create multipletext files....

If you install Lynx, you can easily run the page through Lynx and let it 
output a nicely formatted text file - but without support for tables.

My favorite text editor NoteTab also has a good 'convert to text' function, 
and can be scripted to run through a complete directory of HTML files on 
your hard disk.

Tidy however can not do this, so it is rather off-topic for this list.

-- 
 If you don't like having choices    |  Rijk van Geijtenbeek
 made for you, you should start      |   Documentation & QA
 making your own. -  Neal Stephenson |  mailto:rijk@opera.com
Received on Thursday, 6 February 2003 05:04:17 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:53 GMT